Simplifying Machine Learning Lifecycle with DVC

Read this article at ditwrd.dev/en/posts/dvc-workshop

Prerequisites
#

Tools
#

Laptop/PC
Internet

Step by step
#

Github account (Register here)
Fork the repository for the demo here
Login to gitpod here
Create a new workspace
- Choose the forked repository
- Ensure you are using in-browser vscode

We’ll get back to it later

Machine Learning Crash Course
#

If you want to know more about Python, you can check my old “Introduction to Python” article (free and publicly accessible) that I haven’t import to this website, but for now it can be accessed here

What is machine learning?
#

An algorithm that iteratively learn by identifying patterns in data and making sense of it over time using the magic of math 🔢✨

How can machine “learn”?
#

I think we need to start with

How can Human learn?

🧠🧠🧠

We have a brain that contains billions of neurons

So, do machines have brain?

Yes

Kinda

Not really

It is inspired by how human brain works

How do we build this machine “brain”?

To build a ML model, we need to know what it is made up of. It is made up of Neuron, a simple place holder for a number.

When we have bunch of neuron that “talk” to each other, they create a “Neural Network”

A neural network consist of layers that simply called:

Input
Hidden
Output

Where the number of layers and neuron per layer can be tuned to fit a problem. The increase of layer count create a “deeper” model, this is where the term “Deep Learning” came from

How Neural Network actually learn?

Using clever math, it all started in 1943 on the Bulletin of Mathematical Biophysics’s journal with the article titled “A Logical Calculus of Ideas Immanent in Nervous Activity” by W. Mcculloch and W. Pitts.

Perceptron introduced a concept called “Forward Propagation”, a mathematical model based on neuron as shown below

$$ \hat{y} = f(\sum_{n=1}^{N}w_n.x_n + b) $$

where:

$\hat{y}$ = output
$f$ = non linear function
$w$ = weight
$x$ = input
$b$ = bias (or sometime $w_0$)

The “learning” part of the model back then was done either manually or using a simple weight update like the function below

$$ w_{n+1} = w_n + \alpha . (y-\hat{y})x $$

where:

$y$ = desired output
$\hat{y}$ = model output
$\alpha$ = learning rate
$w$ = weight
$x$ = input
$b$ = bias (or sometime $w_0$)

This approach of updating the weight base on the error or in this case the difference between the model and desired output help the model learn

Researcher took a long time going down the rabbit hole and finally on 1970 (S. Linnainmaa on his Master Thesis about reverse mode of automatic differentiation) the modern method of “learning” was introduce where it help scale the “learning” portion

I’ll save you from reading too much math, because it’s a lot of math to dumb down, I recommend these playlist for you to get both the math and intuition behind it

Playlist Reference

3blue1brown

Andrew Ng - DeepLearningAI

For now, I’ll just give you the intuition behind the magic of the method called “Backpropagation”

In essence, it’s calculating the gradient between loss (or error) and weight (and bias). The gradient is then used to do “gradient descent” where we update the weight and bias based on the gradient. From the image above you can see that the gradient help guide the parameters to the lowest loss. Yes, this is an oversimplification, because the search of the lowest loss doesn’t exactly happened in 2 dimension as we seen in the image. The number of dimension the model search is proportional to its parameter, so if you have a million parameter, the model search space is a million dimension in size

Machine Learning Lifecycle
#

This is going to be a short one, this is a breakdown of a simple machine learning model lifecycle. In production environment for a larger company the technique used for each of these steps would differ and some steps can even be broken down into more steps, but at its core the step and goal remain the same.

Data Fetching
#

There are multiple way to get data, we can do our own data collection/scraping or maybe use secondary data from data repository such as HuggingFace or Kaggle

Data Analysis
#

It’s going to be helpful if we do data analysis to at least know the characteristic of our data. This will help us prepare for the next step

Data Preparation
#

Once we know the characteristic of our data, we can try to use technique to sanitize our data (Undersampling, Augmentation, Noise Reduction, etc) ready for model training. The data can be split into 3, training, testing and validation. Usually the definition of testing and validation data can be mixed up so don’t get confused by it. Training data is the data used by the model to learn. Testing data is the data that is used during training for us to know whether the model can generalize its knowledge on data they never see, this is where we tune our model. Validation data is the last data that the model never see and we never use to tune our model, this data serve for the final model to really test its capability.

Model Training
#

This is the core of a machine learning lifecycle, 99% of your time will be spent on this, this is where you’ll fine tune the model, changing parameter and even going back a few steps to ensure you are using the best version of your data. The training process will use the training and testing part of your data

Model Validation
#

This is where the model is tested with validation data

Deployment
#

If everything is good, we can deploy the machine learning model to server our purpose. For research purpose, we can create a GUI or CLI that can help us use the model.

Hands On: ML Project with DVC
#

In this hands on project, we are going to use the famous MNIST dataset, a dataset that consists of hand written digit and its label. This is a cool first project because in the end of the project we can see the model we built actually learn to label data

Initial Setup
#

Dependency Manager - uv
#

We start by installing uv, a pretty new python dependency manager based in Rust that is definitely blazingly fast.

Dependency manager is a tool that we can use to manage our dependencies or to put it simply, manage our libraries on our project. There are a lot of dependency manager in the market right now, and Python have their own built in dependency manager called pip but we are using uv as it is way faster.

To install we can run this command in our gitpod terminal

curl -LsSf https://astral.sh/uv/install.sh | sh

Continued with initializing uv in our project

uv init

We then can start installing our libraries for our project, in this case it’ll be dvc, dvc-gdrive, tensorflow, numpy and ipykernel

uv add dvc dvc-gdrive dvclive \
ruamel.yaml python-box \
tensorflow numpy \
ipykernel

DVC - ML Pipeline Tools
#

DVC (Data Version Control) is a tool to manage the version of our data. Not only data management, it can be use to create a machine learning pipeline that interface with multiple python scripts and data where it’ll automatically run specific part of the pipeline based on the current changes in our code, minimizing error such as forgetting to run specific script

To use dvc in our project we must first initialize it

uv run dvc init

We can then setup dvc to use a Google Drive folder to save our data. Create a google drive folder and save the UNIQUE_ID written in the folder URL

uv run dvc remote add dvc-demo gdrive://UNIQUE_ID

There is currently an issue with dvc access being blocked when using personal gmail account to connect
The issue arise just two weeks before the writing of this post
Track the issue here here

Due to the issue above, the next best thing to do is to use a service account, to do that you need to have a GCP (Google Cloud Provider) account with a project, then you can create a Google service account through this guide here and once done, create a service account key using this guide here

We can then save the json to accountService.json in our project and adding accountService.json inside .gitignore to prevent it being commited to the repository. Ensure that you’ve shared the Google Drive folder with the service account email as an Editor

Once finished, we setup dvc to use the service account and let it know where the json file is

uv run dvc remote modify dvc-demo gdrive_use_service_account true
uv run dvc remote modify dvc-demo gdrive_service_account_json_file_path accountService.json
uv run dvc remote default dvc-demo

Data Fetching
#

We are going to use the MNIST dataset that we can easily download from Google. But first, we need to create the folder where we save our data. Create a folder called data and create 3 subfolder inside it named raw, interim and processed with an additional .gitkeep inside each subfolder. Raw is where we placed our raw data, interim is where we create a cleaned version of our data and proccessed is where the data that’ll be used for experiment will be.

Go to data/raw and download the data manually first from an external source

cd data/raw
wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz

mnist.npz will be right inside our data/raw folder. Now we can use dvc to save the data to our Google Drive folder.

uv run dvc add mnist.npz
uv run dvc commit
uv run dvc push

It’ll create a mnist.npz.dvc file that is used by dvc to track our data. This file is the only file that is pushed to Github, so there is no need to push a lot of data to Github and it’ll definitely be a hassle to manage when we have multiple GB of file to work with.

Data Exploration
#

This exploration will be simple just to keep this hands on short. To do this we can use Jupyter notebook to play around with our data. For now the goal is to simply check how to open mnist.npz.

Create a notebooks folder and create 1 Data Exploration.ipynb inside. Open the file and we can start coding

import numpy as np

path = "data/raw/mnist.npz"

with np.load(path) as f:
  x_train, y_train = f["x_train"], f["y_train"]
  x_test, y_test = f["x_test"], f["y_test"]

print(f"{x_train.shape=}")
print(f"{x_test.shape=}")
print(f"{y_train.shape=}")
print(f"{y_test.shape=}")
print(f"{x_train[0]}")

This code is going to load the file using np.load and from there we can check the amount of data + the data given.

Data Preparation
#

From our previous step we can see the amount of data and we can see the data used. Now, we are going to create a data pipeline that’ll normalize the data value range from 0-255 to 0-1 and saved it into a new file in processed. To keep this hands on short we’ll not use interim data, but if your data is scattered around, this’ll definitely help.

Now, we’ll create the script that’ll do the things mentioned above, but first, we need to restructured our folder. uv init has created src/dvc_demo with an __init__.py inside of it. We need to first extract dvc_demo outside of src, contiuned to deleting the src folder. This is done to reduce folder depth that may cause headache in the future once the code grows bigger

Let’s start our script at dvc_demo/preprocessing/normalize.py. The code for the script is as shown below

import numpy as np
from box import ConfigBox
from ruamel.yaml import YAML


INPUT_PATH = "data/raw/mnist.npz"
OUTPUT_PATH = "data/processed/mnist_clean.npz"

yaml = YAML(typ="safe")


def open_mnist():
    with np.load(INPUT_PATH) as f:
        x_train, y_train = f["x_train"], f["y_train"]
        x_test, y_test = f["x_test"], f["y_test"]
    return x_train,y_train, x_test, y_test

def normalize(arr,num):
    return arr/num

def save_data(data_dict):
    np.savez_compressed(OUTPUT_PATH,**data_dict)

if __name__ == "__main__":
    params = ConfigBox(yaml.load(open("params.yaml", encoding="utf-8")))

    x_train,y_train, x_test, y_test = open_mnist()

    x_train_norm = normalize(x_train, params.preprocess.normalize.num)
    x_test_norm = normalize(x_test, params.preprocess.normalize.num)

    data_dict = {
        "x_train": x_train_norm,
        "y_train": y_train,
        "x_test": x_test_norm,
        "y_test": y_test,
    }

    save_data(data_dict)

and an additional params.yaml

preprocess:
  normalize:
    num: 255

This script has the raw data as the input and the clean data is output to data/processed/mnist_clean.npz. To run the script, go back to the root folder and use uv run

cd ../..
uv run dvc_demo/preprocessing/normalize.py

This will generate the normalized mnist data

This approach is okay, but running this everytime is not great, thats why we use dvc. Now we are going to create a stage, or in some sense a start of a pipeline where we tell dvc what every input and output dependency. To do this for the script above we can do this

uv run dvc stage add \
-n preprocess-normalize \
-p preprocess \
-d dvc_demo/preprocessing/normalize.py \
-d data/raw/mnist.npz \
-o data/processed/mnist_clean.npz \
uv run dvc_demo/preprocessing/normalize.py

The -n flag correspond to the stage name, -p is the parameter that we can use for experimenting later, -d is the dependency for the process this can be any file and scripts needed for the process to work and -o is the output of the process.

Now you can run these command to see the connection between process and also run all of the stage

uv run dvc dag
uv run dvc repro

Model Training
#

For the last two step of an ML lifecycle, we are going to go with the same approach as before:

Creating scripts in Notebook to experiment
Putting it into a script
Creating stage with dependencies and parameters

This is the code

from dvclive import Live
from dvclive.keras import DVCLiveCallback
import tensorflow as tf
from ruamel.yaml import YAML
from box import ConfigBox
import numpy as np

yaml = YAML(typ="safe")
INPUT_PATH = "data/processed/mnist_clean.npz"


def train():
    params = ConfigBox(yaml.load(open("params.yaml", encoding="utf-8")))
    print(params)

    # Open data
    with np.load(INPUT_PATH) as f:
        x_train, y_train = f["x_train"], f["y_train"]
        x_test, y_test = f["x_test"], f["y_test"]

    layer = [tf.keras.layers.Flatten(input_shape=(28, 28))]

    for node_num in params.train.node:
        layer.append(tf.keras.layers.Dense(node_num, activation="relu"))
    layer.append(tf.keras.layers.Dropout(0.2))
    layer.append(tf.keras.layers.Dense(10))

    model = tf.keras.models.Sequential(layer)
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
    with Live("results/train") as live:
        model.fit(
            x_train,
            y_train,
            validation_data=(x_test, y_test),
            callbacks=[DVCLiveCallback(live=live)],
            epochs=params.train.epochs,
        )
        model.save("models/model.keras")
        live.log_artifact("models/model.keras", type="model")


if __name__ == "__main__":
    train()

To ensure that dvc have data for each training loop, there is an additional Live from dvclive that help track the training process

To ensure everything works, we need to add this to params.yaml

train:
  epochs: 5
  node:
    - 32
    - 8

Now we can add a stage for training like this

uv run dvc stage add \
-n train \
-d data/processed/mnist_clean.npz \
-d dvc_demo/model/train.py \
-o models/model.keras \
uv run dvc_demo/model/train.py

Model Evaluation
#

The model evaluation for this Hands On is a bit redundant because we are just doing the exact validation done by the model during training, this is added for the sake of completeness and can be a boilerplate for numerous amount of evaluation that you can do

from dvclive import Live
from dvclive.keras import DVCLiveCallback
import tensorflow as tf
from ruamel.yaml import YAML
from box import ConfigBox
import numpy as np

yaml = YAML(typ="safe")
INPUT_PATH = "data/processed/mnist_clean.npz"
INPUT_MODEL = "models/model.keras"


def eval():
    params = ConfigBox(yaml.load(open("params.yaml", encoding="utf-8")))
    print(params)

    # Open data
    with np.load(INPUT_PATH) as f:
        x_test, y_test = f["x_test"], f["y_test"]

    model = tf.keras.models.load_model(INPUT_MODEL)

    with Live("results/evaluate") as live:
        evaluation = model.evaluate(
            x_test,
            y_test,
        )
        print(evaluation)
        live.summary["evaluation"] = evaluation



if __name__ == "__main__":
    eval()

uv run dvc stage add \
-n evaluate \
-d data/processed/mnist_clean.npz \
-d dvc_demo/model/evaluate.py \
-d models/model.keras \
uv run dvc_demo/model/evaluate.py

Experimentation
#

This is where things become fun. We now have created a trackable and idempotent ML pipeline where we can start our tuning.

This is the main command that’ll you use for tuning

uv run dvc exp run \
-n experiment-1 \
-S "train.node=[128,64,32]" \
-S "train.epochs=10"

-n correspond to the name of the experiment (if we don’t use this the experiment name is autogenerated) and -S correspond to the params that we are changing

Using vscode we can install an extension extension to help visualize everything, this’ll help us see the experiment and also plot everything, but we can also run uv run dvc exp show where we can see everything via the terminal

Now, for every change you do, either from your program, additional stage or trying to find a better param, your ML pipeline has been settled and you can focus on the things that matter

For more in-depth documentation you can check out DVC own website

Wait, what about deployment?

Deployment is a vast topic, especially in Machine Learning, both simple and super complex, either to scale for millions of user or for research purposes, for now this is left as an exercise for the reader

Thank you!
#

Thank you for reading this article, the topic that has been taught here just scratch the surface of machine learning with a touch of imperfection and oversimplification here and there, I hope this’ll bring you closer to your goal, whatever it is

For any inquiry feel free to contact me via email [email protected] or through my Linkedin DMs

Have a good day!

Prerequisites#

Tools#

Step by step#

Machine Learning Crash Course#

What is machine learning?#

How can machine “learn”?#

Machine Learning Lifecycle#

Data Fetching#

Data Analysis#

Data Preparation#

Model Training#

Model Validation#

Deployment#

Hands On: ML Project with DVC#

Initial Setup#

Dependency Manager - uv#

DVC - ML Pipeline Tools#

Data Fetching#

Data Exploration#

Data Preparation#

Model Training#

Model Evaluation#

Experimentation#

Thank you!#

Prerequisites
#

Tools
#

Step by step
#

Machine Learning Crash Course
#

What is machine learning?
#

How can machine “learn”?
#

Machine Learning Lifecycle
#

Data Fetching
#

Data Analysis
#

Data Preparation
#

Model Training
#

Model Validation
#

Deployment
#

Hands On: ML Project with DVC
#

Initial Setup
#

Dependency Manager - uv
#

DVC - ML Pipeline Tools
#

Data Fetching
#

Data Exploration
#

Data Preparation
#

Model Training
#

Model Evaluation
#

Experimentation
#

Thank you!
#