Need help creating a CNN/DM summary databunch


I’m trying to build a summarizer which will use the CNN/DM dataset (, but I’m having ram issues to actually convert it into a databunch.

I’ve cleaned and converted the dataset into a dataframe which is 1.2 GB in size while pickled, ~230k rows with [‘text’, ‘label’] ([article, summary]) structure where each row contains roughly 2k tokens in total.

I’m using the same techniques as shown here:

First i tried it on a 8GB machine, but that couldn’t even handle creating a databunch on 100k lines of the dataset. I tried it on another machine with 16GB and that could do a 100k databunch. But when trying on the full dataset it runs out of ram when trying to save the databunch to file.

I spent a day trying to get to work, but I couldn’t really tie it all together properly.


from fastai.text import *
from lorkin_funcs import * #Package containing the Seq2SeqTextList class among other things

SEED = 1337
defaults.cpus=4 #To avoid broken pool issues

Seq2SeqTextList.from_df(pd.read_pickle("cnndm_cleaned_len20-90.pkl"), path = "", cols='text')\
.label_from_df(cols='label', label_cls=TextList)\
.save("cnndm_full") #All done on one line to keep unneccessary variables to a minimum