Out of core data block ItemList backed up by memmap files

Hi!

Sometimes one needs to feed large(er than RAM) datasets into the learner. If there are images, or CSVs, there are solutions in fastai. However, if the instances are n dimensional float arrays things get fuzzy. Too big to spill them as csvs, and no routines to load out-of-core large chunks of floats. Saving one instance per file is again not something that I would do (eg 10M or more files, each with few KB - MB)

Following the thread Data block api for hdf5 and data streaming, Can DataBunch or data_block api work with hdf5? and Can DataBunch or data_block api work with hdf5? and of course my own needs, I spent some quality hours investigating this issue.

HDF5
It is a nice data format! Out of core by default, I played with it for the past 5 years but multiprocessing is a no-go for it. I wrote a hdf5 backed ItemList and it silently corrupted the data.

Memory mapped binary files. This is a feature from numpy. The OS basically loads[maps] only the relevant parts from a large file, into the memory. And afaik it is on demand. Completely transparent to the user.

Here is a standalone code for such an ItemList: https://pastebin.com/WiEQuFnE
Cavers:

  • One needs to carry around the shape of the data. (Limitation from numpy memmap)
  • For me, debugging from pyCharm, with multiprocessing active, is a no go. Main thread is getting killed. Probably sth to do with pytorch.multiprocessing library or sth. Use num_workers=0 when creating databunch for debugging purposes.
  • No NLP support. I don’t have production experience with NLP so I don’t really know the tiny little details about the data format there.

If you find this useful, feedback is more than welcome!

2 Likes

Apparently I can’t edit previous post.

Updates:

There was a bug in indexing, now corrected.
Now one can have not only float32 data type.

What this tell me is NOT to try, since you seem to work on the hdf5 for a long time but still haven’t got it done. Fastai isn’t really good for those with some but not a lot of experience, I have to use pytorch instead.