ML Data Versioning with DVC: How to manage machine learning data

08 October 2021

Machine learning projects are a beautiful medley of the code used to build models and the data used to train them. ML models are complex beasts that change through a myriad of methods from the datasets used and the way they’re transformed to the changing of the code itself. Keeping your head on straight with the huge complexity of models you’ve created, or could have created, with each version is no simple task. How do you track what you’ve done? Share it? Reproduce it?!? What you’re searching for here is data versioning and specifically ML data versioning with DVC.

Data versioning with DVC isn’t the light at the end of the tunnel, it’s the headlight beams keeping you on track. At Appsilon we work with a lot of machine learning data, models, and parameters. We know the importance of keeping everything well organized. By managing data and version control we’re able to deliver high-quality ML solutions for complex, fast-paced projects.

Continue reading as we share how to start organizing your data in a project using DVC.

Common problems with ML data versioning
What is DVC?
How does ML data versioning with DVC work?
Use case example
The solution

Common problems with ML data versioning

There’s a common trap I see often when starting a project with a small PoC. The process usually begins with gathering some data and creating a model in Jupyter Notebook. Some resampling here and a dash of data preprocessing there and sure enough you’ve made some decent models. As time progresses, this PoC turns out to be a stepping stone into a project worth pursuing. And so we put in more time and more effort. And we create more models, better models. But soon enough this so-far-so-smooth process hits a snag.

Stop me if you’ve heard these before:

“Was it in model_3final.pth or model_3last.pth that I used a bigger learning rate?”

“When did I start using data preprocessing, during model_2a.pth or model_2aa.pth?”

“Is model_7.pth trained on the new dataset or on the old one?”

“Oh, gosh, which set of parameters and data have I used to train model_2.pth? It was pretty good in the end…”

If you feel attacked, just know you’re not alone. We all have these problems. And as you begin incorporating others into your team they will only snowball.

Ensure clean, well-formatted data for your ML model by using data.validator for your data validation.

But fear not. There are solutions to these problems. In some cases, a few of them can be solved using Git. However, pushing a 2 GB+ dataset into your Git repository has its hangups. You need a tool to adapt Git from a traditional software development manager to handle ML projects. And that’s exactly what DVC does.

What is DVC?

DVC (data versioning control) is an open-source tool that makes data science and machine learning projects easy to reproduce and share. It can handle large datasets, ML models, and lets ML engineers include best practices into their workflow. You can use it with Git to track data, parameters, and other aspects of your ML project. It’s important to recognize that

With DVC you can easily store code in a Git repository and data/models in an AWS/GCP/Azure/etc. storage. Git is fairly flexible, but it doesn’t manage those large ML data very well. However, DVC creates small metafiles that are committed to Git while supporting external, remote caches. You can push these files to external storage and switch between your datasets. Basically, DVC stores copies of data in the backend, while Git manages the tracking of changes to said data. DVC is not a replacement for Git, but rather a way to enable smoother cooperation between tools.

How does ML data versioning work with DVC?

Under the hood, DVC hashes every file in the directory data, adds it to .gitignore and creates a small file data.dvc that is added to Git. By comparing hashes, DVC knows when files change and which version to restore.

Interested in image classification? Create your first image classification project with Appsilon’s image classification tutorial.

Use case example

Let’s take a look at an example. Suppose you have the following files in your directory:

── README.md
├── data
│   ├── secret_appsilon_data_p001.csv
│   ├── secret_appsilon_data_p002.csv
│   └── secret_appsilon_data_p100.csv
├── model.py
├── params.yaml
├── test.py
├── train.py
└── utils.py

Everything seems to be working fine. But after a few iterations your directory might look more like this:

── README.md
├── data
│   ├── secret_appsilon_data_p001.csv
│   ├── secret_appsilon_data_p002.csv
│   ├── secret_appsilon_data_p100.csv
│   ├── secret_appsilon_data_p001_2.csv
│   ├── secret_appsilon_data_p002_2.csv
│   └── secret_appsilon_data_p100_2.csv
├── model.py
├── models
│   ├── good_params_model3.yaml
│   ├── model.pt
│   ├── model2.pt
│   ├── model3.pt
│   ├── model3a.pt
│   └── model5.pt
├── params.yaml
├── test.py
├── train.py
└── utils.py

Even if you prefer to work with organized chaos, any additional team members will likely have no clue which model has been trained when, on which set of parameters, which version of data, and so on. Frankly, it’s a mess and we could use some help. Although DVC has a few features dedicated to experiments, let’s start with the easiest way to utilize DVC.

What is the YOLO algorithm? YOLO object detection algorithm made simple.

A solution to ML data versioning with DVC

Coming back to the repository initiation we can do:


git init
dvc init

Let’s begin by adding the data directory.

dvc add data

The output:

100% Adding...|████████████████████████████████████████|1/1 [00:00, 2.09file/s]

To track the changes with Git, run:

git add data.dvc .gitignore

DVC will guide you through how to add tracking of files in Git. Instead of adding the whole data directory to the repository, you only add the data.dvc file. You can check that data directory has been added to .gitignore.

Next, we add the rest of the files to Git.


git add data.dvc .gitignore
git add README.md model.py, params.yaml test.py train.py utils.py
git commit -m "Init repo"
git push

The code has now been uploaded to the repo, but the data is still on your machine. You can either use a special directory on a disk for DVC cache or use s3/gs/gdrive/etc. To configure the s3 bucket you will need to run:


dvc remote add -d myremote s3://mybucket/path
git add .dvc/config
git commit -m "Configure remote storage"

Now you can run:


git push
dvc push

Congratulations! Your code and data are now versioned in the repository.

If you change data, just run:


dvc add data
git add data.dvc
git commit -m "New data added"
git push
dvc push

You can treat models, the same way you treat data. It is preferable to have one file model.pt, that changes with new data and alternate parameters. If you need to go back in time and check how the data looked in a previous version, use git checkout XXX and dvc checkout.

However, this is not an ideal way to track your experiments. For tracking experiments specifically, DVC provides an entire interface. You can learn more in the DVC documentation for experiments.

Automate deployment on RStudio Connect by building a CI/CD pipeline for Shiny apps using Gitlab-CI.

Integrating ML data versioning with DVC in your workflow

Working with large datasets can be challenging. Whether you’re moving it, cleaning it, or tracking it, it’s difficult to keep on top of. Especially in the context of an ML project. But by applying DVC to control data versioning with Git, along with other DataOps best practices, you don’t have to risk losing productivity and quality.

Keep your team on track, and your deadlines on target by establishing workflows, version controlling models, and creating reproducible experiments. Make life simpler for both you and your team by incorporating DVC into your ML projects. If you need help consider reaching out to Appsilon. Our ML and Computer Vision team have experience building custom ML solutions and can adapt existing solutions to fit your needs. Explore some of our case studies and discover the possibilities of ML and computer vision.