When solving certain tasks, I have to deal with datasets consisting of tens of millions of images.
One of the difficulties which I want to bring up here is how to lay them out in a filesystem. My initial attempts to store all images in one folder failed miserably: I experienced rapid slowdown when writing and reading files from it. I tried ext4, XFS, NTFS. Splitting into subfolders helps somewhat when using some of the filesystems.
I am not an expert in filesystems and my conclusions may be wrong, but I ended up with the opinion that using a general-purpose filesystem as a container for a dataset is an overkill, because filesystem has to spend a lot of time and space supporting use-cases, which are not necessary for a training dataset. Specifically, for a training dataset I only need to read the bytes corresponding to a image - and that’s it. General purpose filesystem additionally manages things such as: creation time, last modification time, listing files in a directory, and a lot of other things which I am not aware of.
Not only reading and writing files on a filesystem with tens of millions of files is slow, but managing such dataset is extremely slow as well. What typically happens: I store the dataset in a low-bandwidth low-IOPS storage (meaning: S3, or spinning hard drive, or some kind of network-attached storage), and copy it to SSD/attached-storage when I need to use it. This is painfully slow if the resulting dataset contains millions of files.
My initial solution was simple: put every image into a tar file and keep a separate index file. Index file would contain the starting byte and ending byte for every photo. These are just two files from an underlying filesystem perspective: easy to move around, and causing no slowdown.
Over time, I realized that managing the index file is a chore. That’s when I switched to SQLite. SQLite is awesome! My huge datasets are now each a single SQLite file, with single table, and two columns: key/name, and binary blob containing the file. SQLite manages the “index” for me and allows easy and reliable updates to the dataset. From performance perspective: I see no noticeable difference when reading random images compared to my initial tar-container method or traditional photos-as-files method.
Additionally, there are SQLite viewers/clients supporting blobs, recognizing them as images, and able to display the image - meaning that I can still manually explore the dataset if I want to.
Now, my goal with this post is threefold:
- Being no filesystem expert, am I missing something? Has the community figured out how to store millions of files on a filesystem, without slow down?
- If not, I suggest to try SQLite. My experience using SQLite as a container for a dataset is extremely positive.
- I am completely new to fastai library but I am immediately a huge fan. Would it be useful if it would have code supporting datasets stored in SQLite, similar to how fastai works beautifully with datasets stored as individual files?