I have been trying to create a databunch from a large text corpus to train a language model. When attempting to save my data bunch, it crashes due to memory issues. After some digging, I have found this GitHub thread that shows the issue is from trying to use pkl to save:
The suggested work around is to save data.train_ds.x.items, data.train_ds.y.items, data.valid_ds.x.items, data.valid_ds.y.items, and data.train_ds.x.vocab individually. Since these are numpy arrays, these can be saved nicely as a compressed npz file. t looks like the only way to retain the tokenized text is
text_list = [ data.train_ds.x[i].text for i in range(0, len(data.train_ds.x) ]
and save this as a numpy array
Unfortunately, I am at a loss on how to construct a databunch after loading these arrays from my saved npz file. Could someone point me in the right direction?
@sgugger - looping you in since you responded to the original GitHub thread.