With many datasets consisting of very large image libraries I still haven’t found a good way to manage and store them. In a perfect world I’d like the following:
- No bottleneck during training
- Stored on disk instead of memory
- One file each for training, testing and validation
Jeremy mentions bcolz, however it doesn’t seem to store everything in 1 file, but in 1 folder consisting many files. This is a little problematic when copying and syncing data, because many small files result in bad transfer speeds. Many datasets are provided in a pickled format, which seems to bottleneck the training process because its stored on disk. Using H5PY also seems a little slow and the data isn’t compressed very well when using LZF.
So how are you managing large datasets that might not fit into memory without affecting training performance?