Hi!
Sometimes one needs to feed large(er than RAM) datasets into the learner. If there are images, or CSVs, there are solutions in fastai. However, if the instances are n dimensional float arrays things get fuzzy. Too big to spill them as csvs, and no routines to load out-of-core large chunks of floats. Saving one instance per file is again not something that I would do (eg 10M or more files, each with few KB - MB)
Following the thread Data block api for hdf5 and data streaming, Can DataBunch or data_block api work with hdf5? and Can DataBunch or data_block api work with hdf5? and of course my own needs, I spent some quality hours investigating this issue.
HDF5
It is a nice data format! Out of core by default, I played with it for the past 5 years but multiprocessing is a no-go for it. I wrote a hdf5 backed ItemList and it silently corrupted the data.
Memory mapped binary files. This is a feature from numpy. The OS basically loads[maps] only the relevant parts from a large file, into the memory. And afaik it is on demand. Completely transparent to the user.
Here is a standalone code for such an ItemList: https://pastebin.com/WiEQuFnE
Cavers:
- One needs to carry around the shape of the data. (Limitation from numpy memmap)
- For me, debugging from pyCharm, with multiprocessing active, is a no go. Main thread is getting killed. Probably sth to do with pytorch.multiprocessing library or sth. Use
num_workers=0
when creatingdatabunch
for debugging purposes. - No NLP support. I don’t have production experience with NLP so I don’t really know the tiny little details about the data format there.
If you find this useful, feedback is more than welcome!