Lesson 9: BcolzArrayIterator complaining about chunklen -- how can one change the chunklen of a bcolz carray?

gai · July 16, 2017, 5:08am

Hi,

I am trying to get started with using data sets larger than my GPU memory but one problem I am running into atm is that I cannot reduce the batchsize as BcolzArrayIterator complains: batch_size needs to be a multiple of X.chunklen

Is there a way to reduce the chunklen of a bcolz carray?

Thanks!

gai · July 16, 2017, 9:43am

It’s very simple:

arr_hr_16 = bcolz.carray(arr_hr, expectedlen=16)
arr_lr_16 = bcolz.carray(arr_lr, expectedlen=16)

Not sure why arr_lr_16.chunklen returns 4, though

MPJ · July 16, 2017, 11:59am

Hey @gai,

I think you are confusing the expectedlen parameter with the chunklen parameter (both optional)
see: http://bcolz.readthedocs.io/en/latest/reference.html#the-carray-class

The expectedlen parameter helps the bcolz library to determine how many objects in your iterable to compress into one chunk where the chunlen paramter will set the ammount of objects that compress into the same chunk.

For our purposes it is generally a good idea to make the chunklen the same or a multiple of the batchsize.

Good luck!

Borz · September 2, 2017, 7:30pm

I get an error when trying to train on carrays with resized chunk length. For some reason the computer’s not seeing the same number of inputs and labels, even though the low and hi-res arrays contain the same number of elements (19349, 72, 72, 3) and (19439, 288, 288, 3)

Going along with the code in neural-sr, I change arr_lr (& _hr) to arr_lr_c6 (specifying a chunklen of 6) via:

arr_lr_c6 = bcolz.carray(arr_lr, chunklen=6, rootdir=path+'trn_resized_72_c6.bc') arr_lr_c6.flush()

and updating all mentions of arr_lr to arr_lr_c6. When I get to running train(..) I run:

%time train(6, 3240)

I get:

ValueError

–traceback–

ValueError: Input arrays should have the same number of samples as target arrays. Found 5 input samples and 6 target samples.

This is triggered in the traceback:
/home/wnixalo/miniconda3/envs/FAI3/lib/python3.6/site-packages/keras/engine/training.py in check_array_lengths(inputs, targets, weights) 191 'the same number of samples as target arrays. ' 192 'Found ' + str(list(set_x)[0]) + ' input samples ' --> 193 'and ' + str(list(set_y)[0]) + ' target samples.') 194 if list(set_x)[0] != list(set_w)[0]: 195 raise ValueError('Sample_weight arrays should have '

slight stream of conscious:

So, the number of elements of arr_lr_c6 and arr_hr_c6 are the same (19439), the chunklen is 6 for both… I’m wondering if it could be the number of iterations? J.Howard has it set to 18,000, but I don’t know what trn_resized_72_r.bc (with the _r suffix) really contains: it is larger than trn_resized_72.bc:

trn_resized_72.bc contains 19,439 images, but trn_resized_72_r.bc is at least 62277 * 16 = 996,432

Not knowing how 18,000 iterations fits into this, I tried 19439//6 = 3240 and got the error (thought being: 1 for each full multiple of batch size, + 1 more for remainder), so now I’m running with as train(6, 3239), and we’ll see how that runs…

Update:

Yes, it finally worked.

>>> %time train(6, 3239)

out:
CPU times: user 38min 23s, sys: 28min 45s, total: 1h 7min 8s Wall time: 1h 37min 6s

But… am I just throwing away the last 5 or 6 images in the array when I do this?