Data block api for hdf5 and data streaming

I looked into data block api and searched, couldn’t seem to see anything related to hdf5?
if absent, this can be a feature proposal.

" It is widely used in the scientific community for everything from NASA’s Earth Observing System to the storage of data from laboratory experiments and simulations." -

Instead of creating another post, I will add a quick note here. Feel free to give any feedback and correction.

  • Data block api should have a mean of taking in a “data stream”. i.e. an endless supply of inputs (not data aug). This is esp. useful for synthetically generated data, or maybe “feedback from env” such as in RL setting.
1 Like

No there is nothing yet, this would probably require a custom ItemList. This isn’t something we plan to add mid-term but as always, we’re happy to accept any PR.

I want something between these lines, too. I can generate hundreds of GB of data and I need somehow to fit them through DataBunch. Of course, out of core.

Afaik there is no easy way to do that, unless I am saving my data as CSV or a pandas. I was thinking about saving a bunch of numpy arrays and loading them on the fly. But apparently I load them using data_block [without much coding] if every npz file is an instance.

I will keep digging but after @sgugger 's answer, coding an ItemList + hdf5 saving, is the easiest way.

HDF5 does not play well with multiprocessing.
Opening the same file in separate process is not reliable.
I coded a small ItemList subclass that loads two tables (one for items and one for labels), and, pretty often, the items are NOT aligned with the labels.

Check line 20 on this file:

For more experiments, check this thread: Get meaningful error messages out of data_bunch loaders

[spoiler: it works-ish with numpy memory mapped files]

LE: It works well with memory mapped files! Out of core data block ItemList backed up by memmap files