Hi all! This question is related to the Fastai NLP course, particularly machine translation. So the Fastai’s seq2seq models, as showcased in this notebook and this notebook, are very good and trains very fast, but I’d like to try it on a pretty large data set (20 million + sentences) – However, these models start by loading the dataset into a pandas dataframe, in the code here:
src = Seq2SeqTextList.from_df( df, path = path, cols='zh', processor = SPProcessor() ).split_by_rand_pct( seed=42 ).label_from_df( cols='en', label_cls=TextList )
I run out of memory when I try to do this. There is another option from_csv but that one runs out of memory for me too, and it seems like that function turns the csv later into a df anyway. Is there a lazy loading option so that the initial data set doesn’t need to be loaded into memory?