How to store a training dataset with tens of millions of images

dzmitry_p · July 30, 2022, 3:28pm

When solving certain tasks, I have to deal with datasets consisting of tens of millions of images.

One of the difficulties which I want to bring up here is how to lay them out in a filesystem. My initial attempts to store all images in one folder failed miserably: I experienced rapid slowdown when writing and reading files from it. I tried ext4, XFS, NTFS. Splitting into subfolders helps somewhat when using some of the filesystems.

I am not an expert in filesystems and my conclusions may be wrong, but I ended up with the opinion that using a general-purpose filesystem as a container for a dataset is an overkill, because filesystem has to spend a lot of time and space supporting use-cases, which are not necessary for a training dataset. Specifically, for a training dataset I only need to read the bytes corresponding to a image - and that’s it. General purpose filesystem additionally manages things such as: creation time, last modification time, listing files in a directory, and a lot of other things which I am not aware of.

Not only reading and writing files on a filesystem with tens of millions of files is slow, but managing such dataset is extremely slow as well. What typically happens: I store the dataset in a low-bandwidth low-IOPS storage (meaning: S3, or spinning hard drive, or some kind of network-attached storage), and copy it to SSD/attached-storage when I need to use it. This is painfully slow if the resulting dataset contains millions of files.

My initial solution was simple: put every image into a tar file and keep a separate index file. Index file would contain the starting byte and ending byte for every photo. These are just two files from an underlying filesystem perspective: easy to move around, and causing no slowdown.

Over time, I realized that managing the index file is a chore. That’s when I switched to SQLite. SQLite is awesome! My huge datasets are now each a single SQLite file, with single table, and two columns: key/name, and binary blob containing the file. SQLite manages the “index” for me and allows easy and reliable updates to the dataset. From performance perspective: I see no noticeable difference when reading random images compared to my initial tar-container method or traditional photos-as-files method.

Additionally, there are SQLite viewers/clients supporting blobs, recognizing them as images, and able to display the image - meaning that I can still manually explore the dataset if I want to.

Now, my goal with this post is threefold:

Being no filesystem expert, am I missing something? Has the community figured out how to store millions of files on a filesystem, without slow down?
If not, I suggest to try SQLite. My experience using SQLite as a container for a dataset is extremely positive.
I am completely new to fastai library but I am immediately a huge fan. Would it be useful if it would have code supporting datasets stored in SQLite, similar to how fastai works beautifully with datasets stored as individual files?

jeremy · July 30, 2022, 9:20pm

Yes – it’s slightly clunky, but you can do this by using a random hash (or similar) for your filenames, and storing them as (e.g, if the filename is “5e37a91ec.jpg”)

5/e/3/7/5e37a91ec.jpg

That way, each of your folders will only have a few files in.

You could also consider the HDF5 file format. Having said that, your SQLite approach is just fine.

I am completely new to fastai library but I am immediately a huge fan. Would it be useful if it would have code supporting datasets stored in SQLite, similar to how fastai works beautifully with datasets stored as individual files?

Yes definitely! Especially if you can just use the python stdlib or avoid adding large new dependencies.

marcemile · August 7, 2022, 3:19pm

You can have a look at the TFRecord format. It’s storing data in the protobuf format so you get super fast deserialization. Not sure how well it’s supported by fastai. It’s generally used in the Tensorflow/JAX/Keras world.
An advantage over single files is that you’ll have faster read perf.

dzmitry_p · August 8, 2022, 8:27pm

@marcemile I did look at TFRecord and it’s great when data is pre-shuffled. I can then train off spinning/hard-drive because IOPS are not a problem anymore given that I can read the dataset continuously.

For my use-cases the data is usually visual/images and I keep it compressed as PNGs, not as bitmaps/matrices. Last time I looked at TFRecord and HD5 both didn’t work well with non-uniformly shaped data (such as binary image blobs). Do you know if this changed or if I am missing something?

marcemile · August 9, 2022, 6:39am

I don’t understand why “non-uniformly shaped data” would be an issue. If you still have the source, I would like to have a look at it.

Regarding shuffling, I can suggest this notebook. They do a good job at explaining the different strategies Google Colab