How to use bcolz.carray when data set is too large to load into memory?


Im looking at the function Jeremy used but it seems you need to load the entire dataset into memory?

def save_array(fname, arr):

    c=bcolz.carray(arr, rootdir=fname, mode='w');

If I run my this code, it seems to take up all of my RAM memory and crash my computer:

train_data = np.concatenate([next(x)[0] for _ in range(int(np.ceil(total_samples/r_batch_size)))])
 save_array(os.path.join(save_dir, "train_data"), train_data)

Any tips as to how to save it without loading everything into memory?

Thank you!

(Renato Hermoza) #2

You can create an empty bcolz array and append to it:

# assuming you have images of 128x128:
bcolz_array = bcolz.carray(np.zeros([0,3,128,128], dtype=np.float32), mode='w', rootdir=path)
for x in your_data:




Thanks, testing it out right now, though is there a reason why channels come first or is that just a preference due to using theano backend? Thank you


Did some testing, and I"m still running into a memory error.

def save_array(fname, generator_array, batches,data_type = "data"):
    if data_type == "data":
        bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), mode='w', rootdir=fname)
        bcolz_array = bcolz.carray(np.zeros([0,len(labels)], dtype=np.float32), mode='w', rootdir=fname)

    data_dict = {"data": 0, "labels": 1}

    if data_type not in ["data", "labels"]:
        raise ValueError ("data or labels")

    for i in range(batches):

(Renato Hermoza) #5

Oh yeah, forgot to specify the chunklen, this should work :slight_smile: :
bcolz_array = bcolz.carray(np.zeros([0,img_width, img_height,3], dtype=np.float32), chunklen=1, mode='w', rootdir=fname)

Also, I put the channels first because that’s what pytorch expects.


@renato Thank you. I will give it go tomorrow and let you know. Thank you so much =)