How to use bcolz.carray when data set is too large to load into memory?

Im looking at the function Jeremy used but it seems you need to load the entire dataset into memory?

def save_array(fname, arr):

    c=bcolz.carray(arr, rootdir=fname, mode='w');
    c.flush()

If I run my this code, it seems to take up all of my RAM memory and crash my computer:

train_data = np.concatenate([next(x)[0] for _ in range(int(np.ceil(total_samples/r_batch_size)))])
 save_array(os.path.join(save_dir, "train_data"), train_data)

Any tips as to how to save it without loading everything into memory?

Thank you!

You can create an empty bcolz array and append to it:

# assuming you have images of 128x128:
bcolz_array = bcolz.carray(np.zeros([0,3,128,128], dtype=np.float32), mode='w', rootdir=path)
for x in your_data:
    bcolz_array.append(x)

bcolz_array.flush()

@renato

Thanks, testing it out right now, though is there a reason why channels come first or is that just a preference due to using theano backend? Thank you

EDIT:

Did some testing, and I"m still running into a memory error.

def save_array(fname, generator_array, batches,data_type = "data"):
    if data_type == "data":
        bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), mode='w', rootdir=fname)
    else:
        bcolz_array = bcolz.carray(np.zeros([0,len(labels)], dtype=np.float32), mode='w', rootdir=fname)
        

    data_dict = {"data": 0, "labels": 1}

    if data_type not in ["data", "labels"]:
        raise ValueError ("data or labels")

    for i in range(batches):
        bcolz_array.append(next(generator_array)[data_dict[data_type]])
    bcolz_array.flush()

Oh yeah, forgot to specify the chunklen, this should work :slight_smile: :
bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), chunklen=1, mode='w', rootdir=fname)

Also, I put the channels first because that’s what pytorch expects.

2 Likes

@renato Thank you. I will give it go tomorrow and let you know. Thank you so much =)

Hi Moondra,

Can you please update? I am also having troubles with memory. My dataset is about 66 GB while RAM is only 64 GB.

Actually I never got back to it as I was unfreezing layers often.