Managing large datasets in memory and on disk

pietz · February 19, 2017, 9:23pm

With many datasets consisting of very large image libraries I still haven’t found a good way to manage and store them. In a perfect world I’d like the following:

No bottleneck during training
Stored on disk instead of memory
One file each for training, testing and validation

Jeremy mentions bcolz, however it doesn’t seem to store everything in 1 file, but in 1 folder consisting many files. This is a little problematic when copying and syncing data, because many small files result in bad transfer speeds. Many datasets are provided in a pickled format, which seems to bottleneck the training process because its stored on disk. Using H5PY also seems a little slow and the data isn’t compressed very well when using LZF.

So how are you managing large datasets that might not fit into memory without affecting training performance?

Thanks,
Pietz

jeremy · February 20, 2017, 6:21am

You can use bcolz with a tarpipe to copy quickly. Although rsync is fast with small files nowadays assuming you have a decent filesystem.

pietz · February 20, 2017, 2:31pm

thanks for your answer Jeremy. Ill give it another shot. There more I’m experiencing bottlenecks with other formats the more I feel like bcolz is the best solution

MPJ · February 20, 2017, 3:06pm

Hi Pietz,

the standard load commands for bcolz array do not seem to support streaming, however take a look at:

I found a comment somewhere on kaggle by someone who made a bcolz streaming iterator, which makes your training set a lot smaller footprint size in memory. Ideal for when you save intermediate layer data which can be quite a memory hog.

Let me know if you know of a way to stream the output of a prediction batch to a bcolz format

grts

davecg · February 20, 2017, 8:38pm

You can also try using Dask. It can load data from multiple formats (bcolz, hdf, saved npy files, etc) and can gather data across multiple machines too (although I haven’t tried that).

You can use a Dask array with the normal Keras model.fit() method, or with a bit of tweaking of the Keras NumpyArrayIterator you can use one with that as well.

If you know the size of your output you should be able to stream the output of your model to a Dask array as well.

pietz · February 20, 2017, 9:13pm

thanks you too. i will check em both!

jeremy · February 25, 2017, 3:13am

I took a look at this today and it looks great. I’ve suggested some minor changes to the PR and look forward to using it in the course

jpuderer · March 14, 2017, 1:09pm

@davecg Thinking of Dask, but how well do Dask arrays deal with random element access (shuffle=True)? Seems like you might need some kind of custom iterator like the one for Bcolz?

Just wondering if you knew of something extant.

davecg · March 14, 2017, 2:50pm

It works out of the box with shuffle=True, since you’re actually shuffling indices.

The main difference is that you need to use model.fit() instead of model.fit_generator() since Keras thinks it is just a normal numpy array. Somewhere under the hood it is dealing with how actually to get the different pieces of the array.

I’m sure there are a bunch of ways to optimize performance of Dask that I haven’t figured out yet so if anyone else wants to try it and comes across anything useful let me know. I haven’t tried messing around with the scheduler or chunk sizes or anything like that yet, and sadly do not have access to a cluster at the moment so no need for dask.distributed although that looks really cool too…

jpuderer · March 14, 2017, 3:11pm

Right, Dask arrays will definitely work out of the box, but because it’s shuffling indexes you’ll end up loading things from disk in no particular order, it will likely to be inefficient, especially if it’s doing something like bcolz and loading data a chuck at a time.

As Jeremy mentioned in class, the reason for creating a bcolz array iterator was to avoid the inefficiently that comes from loading random chunks of data into memory for a single item. I’m guessing we’d need something similar for Dask.

pietz · March 14, 2017, 3:12pm

Have you ever compared the performance of Dask vs bcolz? Im a little busy at work these days, but I’ll do it as soon as I find the time.

jpuderer · March 14, 2017, 3:24pm

Apparently bcolz will use Dask internally:

bcolz can use numexpr or dask internally (numexpr is used by default if installed, then dask and if these are not found, then the pure Python interpreter) so as to accelerate many internal vector and query operations (although it can use pure NumPy for doing so too). numexpr can optimize memory (cache) usage and uses multithreading for doing the computations, so it is blazing fast. This, in combination with carray/ctable disk-based, compressed containers, can be used for performing out-of-core computations efficiently, but most importantly transparently.

I don’t know exactly what this means (I would need to dig more), but it sounds like they use Dask or numexpr internally to do the computations in parallel. I suspect that bcolz is probably optimized for very fast disk based array and table access, while libraries like Dask are optimized for allowing distributed access to disk based arrays and tables.

If there are any experts that know both reasonably well, your opinion would be greatly appreciated!

davecg · March 14, 2017, 4:27pm

Dask array can also read bcolz arrays.

I haven’t done any real benchmarking, just really liked how easy Dask made coding (particularly the delayed decorator) and haven’t noticed a significant delay (although haven’t done a formal comparison).

I’ll try it on ImageNet when I have time.

jeremy · March 14, 2017, 6:42pm

This is just if you use bcolz.eval - not really relevant to this discussion.

I can’t see any way that it’s possible for a truly shuffled access to ever be fast, if it’s coming from spinning disk, and the dataset is larger than RAM. Disks have more random access performance, and there’s no getting around that! So randomizing chunks seems like the best approach.

iNLyze · March 27, 2017, 10:48pm

I have been using bcolz since part 1. On revisiting some older code it wouldn’t run smoothly, because an option in bcolz has been replaced and some changes have been made to the compression algorithms.
I used to encapsulate large bcolz arrays in

with bcolz.defaults_ctx(vm='python', cparams=bcolz.cparams(clevel=5, cname='zstd')):

so I could finetune the compression level and speed.
The attribute .defaults_ctx isn’t there anymore (after upgrading to py3 and updating bcolz in the process).
It appears to have been replaced by

bcolz.defaults.cparams

which outputs {'clevel': 6, 'cname': 'snappy', 'shuffle': 1}
You can write on this dictionary directly.
Now, I have several problems with that. One being that the bcolz arrays I made in Februrary (using py2.7) can’t be opened anymore. Not sure why, but some

Exception ignored in: 'bcolz.carray_ext.chunk._getitem'
RuntimeError: fatal error during Blosc decompression: -1

occurs and images look black.

The manual on page 11 gives a list of possible cnames (i.e. compression algorithms), but doesn’t explain which one to use when. There is a link to the Blosc documentation, which in turn links to respective projects, but I find it hard to make up my mind which one to use when, especially, since the default has changed and my old data isn’t read correctly.

Can someone advise on how the set the paramters in bcolz under py3 to get it working optimally?

Another observation I made is that with the large dataset I used to be able to write into a carray (just about, I guess) I now get a MemoryError. Perhaps it is related to changing the default compression algo, too - oh, boy. Sorry, can someone comment or help?

jeremy · March 28, 2017, 12:15am

Ick that’s not good! Can you downgrade bcolz (or just create a new conda env for py2 with older bcolz) and use that to convert your saved arrays to snappy compression? Hopefully then it’ll be compatible…

Otherwise, maybe create a github issue and see if the devs can help - if you do this, please let us know what you find!

iNLyze · March 28, 2017, 10:27am

OK, I filed an issue on Github. Let’s see what happens.
For my old data I’ll have to create a python 2.7 environment, I think. I did not write down which one was the old version, but I would take bets there are multiple dependencies which python 3.6. Maybe that’s the cleanest strategy to deal with legacy code.

Christina · March 29, 2017, 9:10pm

Thanks for this tip on Dask! I will have to check this out. I have been running into memory issues on lesson 7 (having to load all the image data into memory for use with the metadata in the multi-input model, for example). It prompted me to go and get more memory for my old Win 7 machine.

It would be much better to be able utilize Dask for this situation.

Thanks, Christina

davecg · March 29, 2017, 9:42pm

It’s definitely pretty cool and getting it to work is pretty straightforward, but I’m sure there’s a lot of room for optimization.

e.g.

import numpy as np
from scipy import ndimage
from pathlib import Path

import dask.array as da
from dask.delayed import delayed

shape=(128,128,128)

@delayed
def load_data(fp, shape=shape):
    #load 3D numpy array files, e.g. For Kaggle DSB
    img = np.load(str(fp))
    #rescale to set size
    img = ndimage.zoom(img, np.array(shape)/img.shape)
    # add channel dim
    return np.expand_dims(img, -1)

p = Path('/path/to/data')

data = da.stack([da.from_delayed(load_data(fp), shape=shape, dtype=np.int32) for fp in p.rglob('*.npy')])

print(data.shape)

#model.fit(data) # works without generator, although trivial to wrap with a generator which should improve speed.

Christina · March 30, 2017, 7:06pm

Thanks, this helps a lot.

However please pardon my ignorance – does this mean I need to convert all the *.jpg files into *.npy files? When I call the da.stack method, I pass *.jpg into rglob, since I want the images on my drive:

trn = da.stack([da.from_delayed(load_data(fp), shape=input_shape_full, dtype=np.int32) for fp in train_p.rglob(’*.jpg’)])

When I then call

model.fit(trn, [trn_labels, trn_bbox], batch_size=batch_size, nb_epoch=nb_epoch, validation_data=(val, [val_labels, val_bbox]))

I get this error as it tries to load the first image:

OSError: Failed to interpret file ‘train\ALB\img_00043.jpg’ as a pickle

i.e. the pointer to the file rather than an array containing the file data. What am I missing? The Dask documentation says you can use Dask arrays interchangeably with Numpy arrays.

Thanks, Christina