Out of core data block ItemList backed up by memmap files

visoft · March 1, 2019, 9:51am

Hi!

Sometimes one needs to feed large(er than RAM) datasets into the learner. If there are images, or CSVs, there are solutions in fastai. However, if the instances are n dimensional float arrays things get fuzzy. Too big to spill them as csvs, and no routines to load out-of-core large chunks of floats. Saving one instance per file is again not something that I would do (eg 10M or more files, each with few KB - MB)

Following the thread Data block api for hdf5 and data streaming, Can DataBunch or data_block api work with hdf5? and Can DataBunch or data_block api work with hdf5? and of course my own needs, I spent some quality hours investigating this issue.

HDF5
It is a nice data format! Out of core by default, I played with it for the past 5 years but multiprocessing is a no-go for it. I wrote a hdf5 backed ItemList and it silently corrupted the data.

Memory mapped binary files. This is a feature from numpy. The OS basically loads[maps] only the relevant parts from a large file, into the memory. And afaik it is on demand. Completely transparent to the user.

Here is a standalone code for such an ItemList: https://pastebin.com/WiEQuFnE
Cavers:

One needs to carry around the shape of the data. (Limitation from numpy memmap)
For me, debugging from pyCharm, with multiprocessing active, is a no go. Main thread is getting killed. Probably sth to do with pytorch.multiprocessing library or sth. Use num_workers=0 when creating databunch for debugging purposes.
No NLP support. I don’t have production experience with NLP so I don’t really know the tiny little details about the data format there.

If you find this useful, feedback is more than welcome!

visoft · May 8, 2019, 12:32pm

Apparently I can’t edit previous post.

Updates:

There was a bug in indexing, now corrected.
Now one can have not only float32 data type.

LIBER · August 4, 2019, 9:36am

What this tell me is NOT to try, since you seem to work on the hdf5 for a long time but still haven’t got it done. Fastai isn’t really good for those with some but not a lot of experience, I have to use pytorch instead.