Need help creating a CNN/DM summary databunch

Hello!

I’m trying to build a summarizer which will use the CNN/DM dataset (https://github.com/abisee/cnn-dailymail), but I’m having ram issues to actually convert it into a databunch.

I’ve cleaned and converted the dataset into a dataframe which is 1.2 GB in size while pickled, ~230k rows with [‘text’, ‘label’] ([article, summary]) structure where each row contains roughly 2k tokens in total.

I’m using the same techniques as shown here: https://github.com/fastai/course-nlp/blob/master/7-seq2seq-translation.ipynb

First i tried it on a 8GB machine, but that couldn’t even handle creating a databunch on 100k lines of the dataset. I tried it on another machine with 16GB and that could do a 100k databunch. But when trying on the full dataset it runs out of ram when trying to save the databunch to file.

I spent a day trying to get https://docs.fast.ai/text.data.html#TextDataBunch.from_ids to work, but I couldn’t really tie it all together properly.

Code:

from fastai.text import *
from lorkin_funcs import * #Package containing the Seq2SeqTextList class among other things

SEED = 1337
defaults.cpus=4 #To avoid broken pool issues

Seq2SeqTextList.from_df(pd.read_pickle("cnndm_cleaned_len20-90.pkl"), path = "", cols='text')\
.split_by_rand_pct(seed=SEED)\
.label_from_df(cols='label', label_cls=TextList)\
.databunch()\
.save("cnndm_full") #All done on one line to keep unneccessary variables to a minimum

Thanks!