jgtjerry
(Jerry George Thomas)
February 26, 2019, 4:55pm
1
I have a dataset with 45 million rows of data. I have three 6gb ram gpu. I am trying to train a language model on the data.
For that, I am trying to load the data as the fastai data bunch. But this part always fails because of the memory issue.
data_lm = TextLMDataBunch.from_df('./', train_df=df_trn,
valid_df=df_val, bs=10)
How do I handle this issue?
1 Like
Kaspar
(Kaspar Lund)
February 26, 2019, 5:57pm
2
TextMLBunch copies the data multiple times and therefore run out of memory.
I made the tokenization before calling the TextMLBunch.from_ids with the generated tokens. Then it can handle 1e9+ tokens
jgtjerry
(Jerry George Thomas)
February 26, 2019, 6:39pm
3
Got it. I have one more query. In what format should I submit the tokens? Is there a reference. Thank you.
Kaspar
(Kaspar Lund)
February 26, 2019, 6:53pm
4
a ragged array (array of arrays of np.int or np,uint16 according to the size of your vocabulary):
tokens in sentence 1 …
tokens in sentence 2 …
tokens in sentence 3 …
…