I’m trying to build a summarizer which will use the CNN/DM dataset (https://github.com/abisee/cnn-dailymail), but I’m having ram issues to actually convert it into a databunch.
I’ve cleaned and converted the dataset into a dataframe which is 1.2 GB in size while pickled, ~230k rows with [‘text’, ‘label’] ([article, summary]) structure where each row contains roughly 2k tokens in total.
I’m using the same techniques as shown here: https://github.com/fastai/course-nlp/blob/master/7-seq2seq-translation.ipynb
First i tried it on a 8GB machine, but that couldn’t even handle creating a databunch on 100k lines of the dataset. I tried it on another machine with 16GB and that could do a 100k databunch. But when trying on the full dataset it runs out of ram when trying to save the databunch to file.
I spent a day trying to get https://docs.fast.ai/text.data.html#TextDataBunch.from_ids to work, but I couldn’t really tie it all together properly.
from fastai.text import * from lorkin_funcs import * #Package containing the Seq2SeqTextList class among other things SEED = 1337 defaults.cpus=4 #To avoid broken pool issues Seq2SeqTextList.from_df(pd.read_pickle("cnndm_cleaned_len20-90.pkl"), path = "", cols='text')\ .split_by_rand_pct(seed=SEED)\ .label_from_df(cols='label', label_cls=TextList)\ .databunch()\ .save("cnndm_full") #All done on one line to keep unneccessary variables to a minimum