CI/CD with image datasets: best practices?

I am currently fascinated by the ease of setup for Continuous Integration with nbdev.
A lot of the operations I run on the notebooks are manipulating image or video files present in my current machine, so whenever I commit the repo, it Github tests obviously fail because it cannot find these files.

I have been looking around the internet for best practices in MLOps with data, but there is either too much information, so I don’t know where to start, or the advice is tailored to other CI/CD frameworks.

What are your recommendations with the following points:

  • where do you store your test data? (images/ video files)
  • how do you version this data, and relate that to the commit versions?
  • Do you usually prefer to automatically generate dummy data or to work with a real dataset?

Thank you!

While I cannot help with the following questions:

I can tell you how I approach a similar situation that arises when I am developing my libraries.

I usually have two options: skip the test on those specific notebooks or use dummy data/have a sample folder with a small subset, like 5 or 10 images that you can commit to git.

Scenario a: To skip tests, just add a file called .notest in the folder that has the notebooks that you want to skip. You could also use your own flags in settings.ini. I have several including one I call research which I skip all notebooks in the research folder. This approach has a benefit in that you can still test them locally before you push to github nbdev_test --flags research notest. Alternatively, you can place a raw cell on top of the notebooks you want to skip tests on if you do not want to change your folder structure, see: nbdev - nbdev1 migration

Scenario b: I have a bunch of sample files I commit to git that I use to develop the library, which resemble the files that are supposed to be worked on. Therefore, using the ‘real’ files, is just an easy switch since the files resemble each other in format and in folder structure. You could alternatively generate dummy data and make it as close as possible to the real one so that your tests pass on Github

3 Likes

FYI the fastai library is a good place to look for ideas on stuff like this. We create tiny versions of datasets for our tests, put them on AWS, and have our tests download them as needed using untar_data.

4 Likes

Storing and versioning of data and models can be done using https://dvc.org/