How to use bcolz.carray when data set is too large to load into memory?

Moondra · January 13, 2019, 6:49am

Im looking at the function Jeremy used but it seems you need to load the entire dataset into memory?

def save_array(fname, arr):

    c=bcolz.carray(arr, rootdir=fname, mode='w');
    c.flush()

If I run my this code, it seems to take up all of my RAM memory and crash my computer:

train_data = np.concatenate([next(x)[0] for _ in range(int(np.ceil(total_samples/r_batch_size)))])
 save_array(os.path.join(save_dir, "train_data"), train_data)

Any tips as to how to save it without loading everything into memory?

Thank you!

renato · January 13, 2019, 12:03pm

You can create an empty bcolz array and append to it:

# assuming you have images of 128x128:
bcolz_array = bcolz.carray(np.zeros([0,3,128,128], dtype=np.float32), mode='w', rootdir=path)
for x in your_data:
    bcolz_array.append(x)

bcolz_array.flush()

Moondra · January 14, 2019, 1:38pm

@renato

Thanks, testing it out right now, though is there a reason why channels come first or is that just a preference due to using theano backend? Thank you

EDIT:

Did some testing, and I"m still running into a memory error.

def save_array(fname, generator_array, batches,data_type = "data"):
    if data_type == "data":
        bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), mode='w', rootdir=fname)
    else:
        bcolz_array = bcolz.carray(np.zeros([0,len(labels)], dtype=np.float32), mode='w', rootdir=fname)
        

    data_dict = {"data": 0, "labels": 1}

    if data_type not in ["data", "labels"]:
        raise ValueError ("data or labels")

    for i in range(batches):
        bcolz_array.append(next(generator_array)[data_dict[data_type]])
    bcolz_array.flush()

renato · January 16, 2019, 11:56am

Oh yeah, forgot to specify the chunklen, this should work :
bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), chunklen=1, mode='w', rootdir=fname)

Also, I put the channels first because that’s what pytorch expects.

Moondra · January 17, 2019, 6:38pm

@renato Thank you. I will give it go tomorrow and let you know. Thank you so much =)

salamat · February 26, 2019, 2:01am

Hi Moondra,

Can you please update? I am also having troubles with memory. My dataset is about 66 GB while RAM is only 64 GB.

Moondra · May 7, 2019, 8:28pm

Actually I never got back to it as I was unfreezing layers often.