With many datasets consisting of very large image libraries I still haven’t found a good way to manage and store them. In a perfect world I’d like the following:
- No bottleneck during training
- Stored on disk instead of memory
- One file each for training, testing and validation
Jeremy mentions bcolz, however it doesn’t seem to store everything in 1 file, but in 1 folder consisting many files. This is a little problematic when copying and syncing data, because many small files result in bad transfer speeds. Many datasets are provided in a pickled format, which seems to bottleneck the training process because its stored on disk. Using H5PY also seems a little slow and the data isn’t compressed very well when using LZF.
So how are you managing large datasets that might not fit into memory without affecting training performance?
Thanks,
Pietz