TextLMDataBunch Memory issue Language Model Fastai

jgtjerry · February 26, 2019, 4:55pm

I have a dataset with 45 million rows of data. I have three 6gb ram gpu. I am trying to train a language model on the data.

For that, I am trying to load the data as the fastai data bunch. But this part always fails because of the memory issue.

data_lm = TextLMDataBunch.from_df('./', train_df=df_trn, 
valid_df=df_val, bs=10)

How do I handle this issue?

Kaspar · February 26, 2019, 5:57pm

TextMLBunch copies the data multiple times and therefore run out of memory.

I made the tokenization before calling the TextMLBunch.from_ids with the generated tokens. Then it can handle 1e9+ tokens

jgtjerry · February 26, 2019, 6:39pm

Got it. I have one more query. In what format should I submit the tokens? Is there a reference. Thank you.

Kaspar · February 26, 2019, 6:53pm

a ragged array (array of arrays of np.int or np,uint16 according to the size of your vocabulary):
tokens in sentence 1 …
tokens in sentence 2 …
tokens in sentence 3 …
…