Issues with get_data, save_array, and load_array

( #1

I am seeing a bizzare issue with get_data function that we use before we save_array to persist the pre-processed numpy array to disk.

Here’s the notebook that I am using -

In cell 13, where I call get_data, the memory usage on my P2 instance increases from almost 0 to almost 55GB! The training data on disk is only 550MB (100 times smaller than the memory occupied). After this step, 8/10 times I end up getting OSError: [Errno 12] Cannot allocate memory as you can see in cell 19.

The remaining times when I don’t get the memory error, fit_generator function takes around 600s to run which is the same amount of time it would take to run if I skipped using bcolz. I remember when @jeremy uses load_array to load the preprocessed arrays, and uses fit_generator it takes him only 300s. Why is the universe being unkind to me? :slight_smile:

I have tried calling only load_array, to load the preprocessed arrays from disk, but I still don’t see any improvements. Infact using get_data and load_array via bcolz seems to be making things worse by occupying way more memory than it needs too.

Anybody else facing the same issue?

(Jeremy Howard) #2

What does ‘du -sh data/dogscats/models/train_data.bc’ show for you? For me, it’s 5.5GB.

It could be just that you don’t have enough RAM to load this array into memory - because bcolz uses compression, it seems quite possible the the data is 55GB. Perhaps you could do a back-of-the-envelope calcuation (each float is 32 bits, and you can figure out how many floats are in that array). If this is the case, you’ll need to use batches from disk, or else rather than loading the whole array into memory, use yourself, which mmap’s the file rather than loading it all.

( #3

train_data.bc is 6.5GB for me.

I will try loading batches from disk, and that will most likely do the trick.

(Shawn) #4

I’m having an issue with get_data as well, or perhaps more specifically np.concat is eating around 12GB of ram for cats-vs-dogs-redux. I used edge here and closed everything possible, went from 1.9GB usage to ~16GB for 23k files, however, when it crashes you can see that it wasn’t near the maximum. Page file usage is around 30GB. I’m not sure what’s happening here, but I’d like to be able to save my data in stages and append it to the bcolz file to try and fix this issue, is that possible? Is this normal I’m crashing using 16GB of ram? The valid set takes about ~2.5 GB of additional ram for 2k files and does complete successfully, and I am able to save/load it.

TLDR: Is it possible to write get_data in such a way that it uses all of a target batch but saves it in stages.

(Shawn) #5

Still haven’t got past this problem. Can anybody help?


Unfortunately I am having the same issue as @glyph but no success getting passed it.

(Shawn) #7

I bought an extra 16 GB of ram, see if that does the trick when it gets here. :stuck_out_tongue:

(Matthijs Jansen) #8

Hi there,

I was struggling with roughly the same issue and found a hidden comment somewhere on kaggle about an iterator for bcolz Carrays. So now, much like with get_batches, you can use bcolz output with fit_generator etc.
If there is interest, I will open a pull request in the course branch.

Usage is like so:

X = + 'train_convlayer_features.bc', mode='r')
y = + 'train_labels.bc', mode='r')
trn_batches = BcolzArrayIterator(X, y, batch_size=X.chunklen * batch_size, shuffle=True)

model.fit_generator(generator=trn_batches, samples_per_epoch=trn_batches.N, nb_epoch=1)

Thanks to R4mon:

This way you avoid having to load the entire dataset into memory when you want to eval the dense layers with pre-computed conv. output.

Hope this helps

Managing large datasets in memory and on disk
(Even Oldridge) #9

I’ve got 12gigs of ram and i’m having trouble even saving the files. When I call save_array after generating the training set python just crashes outright. It’s a little frustrating :frowning:. @glyph How did the extra memory work out for you? I’m thinking of upgrading.

(Matthijs Jansen) #10

Hi @Even,

I worked around a similar problem you’ve experienced by splitting up the training set into batches that did fit in memory.
Bcolz arrays support appending, so you could merge the final set together again if you’d like.

Ofcourse having more memory to begin with will help as well, as does increasing your swap area I believe if you are on Ubuntu.

If you find a good way of streaming the predicted output to a file, I would be very interested :slight_smile:

Kind regards

(Tuan Nguyen) #11

I tried this. At the last batches, it will throw an exception. then keras complains that

fit_generator(self, generator, samples_per_epoch, nb_epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size, nb_worker, pickle_safe, initial_epoch)
1530 '(x, y, sample_weight) '
1531 'or (x, y). Found: ’ +
-> 1532 str(generator_output))
1533 if len(generator_output) == 2:
1534 x, y = generator_output

ValueError: output of generator should be a tuple (x, y, sample_weight) or (x, y). Found: None

Do you have this error?