Memory efficient data bunch save and creation

cdparks · March 12, 2020, 6:02pm

I have been trying to create a databunch from a large text corpus to train a language model. When attempting to save my data bunch, it crashes due to memory issues. After some digging, I have found this GitHub thread that shows the issue is from trying to use pkl to save:

The suggested work around is to save data.train_ds.x.items, data.train_ds.y.items, data.valid_ds.x.items, data.valid_ds.y.items, and data.train_ds.x.vocab individually. Since these are numpy arrays, these can be saved nicely as a compressed npz file. t looks like the only way to retain the tokenized text is

text_list = [ data.train_ds.x[i].text for i in range(0, len(data.train_ds.x) ]
and save this as a numpy array

Unfortunately, I am at a loss on how to construct a databunch after loading these arrays from my saved npz file. Could someone point me in the right direction?

@sgugger - looping you in since you responded to the original GitHub thread.