.sumary keeps running out of memory in Colab (with a lot of RAM)

Dewsy · August 5, 2021, 8:53am

Hi!

I’m getting to know fast.ai, and so far i love it! Really great ppl, great course and great API!

I’m trying to train an NLP regression model (or classification the issue is the same) on Google Collab (paid edition), but no matter what i try, by the step .dataloaders or .summary it crashes as it runs out of working memory. I’m sitting on this issue for weeks now, read everythin i could find, tried to tweak as much as i could, but nothing helped. So, as a final resort, i’m asking for help here, maybe you spot something i couldn’t! Here is my code, as well as the notebook itself:
the notebook

df.head()

is_valid 	squishedDifference 	reviewText
True 	1.000000 	The case pictured is a soft violet color, but the case cover I received was a dark purple. While I'm sure the quality of the product is fine, the color is very different.
True 	1.000000 	case fits perfectly and I always gets compliments on it its hasn't cracked when I dropped it. wonderful and protective.
False 	1.000000 	Best phone case ever . Everywhere I go I get a ton of compliments on it. It was in perfect condition as well.
False 	0.333333 	It may look cute. This case started off pretty good the first couple of weeks. Then it started to slide off. It slid off one day I was in a parking lot phone fell face down and my glass shattered. TERRIBLE CASE!!!!!
False 	0.200000 	ITEM NOT SENT from Blue Top Company in Hong Kong and it's been over two months! I will report this. DO NOT use this company. Not happy at all!

dls_nlp.one_batch()

(LMTensorText([[   2,    8,   10,  ..., 4226,   61, 4369],
         [  30,   66,  222,  ...,   31,   16,  182],
         [   9,    8,  574,  ...,   17,   35,   43],
         ...,
         [  24,   18,   21,  ...,   20,  423,   20],
         [ 296,   22,   10,  ...,  789,   16,  127],
         [1171,   15,   60,  ...,   43,  110,   23]], device='cuda:0'),
 TensorText([[    8,    10,   246,  ...,    61,  4369,    14],
         [   66,   222,    13,  ...,    16,   182,   107],
         [    8,   574,   353,  ...,    35,    43,     9],
         ...,
         [   18,    21,    17,  ...,   423,    20, 13947],
         [   22,    10,   138,  ...,    16,   127,   501],
         [   15,    60,    10,  ...,   110,    23,   141]], device='cuda:0'))

dls_nlp.vocab

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxrep',
 'xxwrep',
 'xxup',
 'xxmaj',
 '.',
 'the',
 'i',
 'it',
 'and',
 ',',
 'to',
 'a',
 'is',
 'this',
 'my'
...

the DataBlock:

dls_class = DataBlock(blocks=(TextBlock.from_df('reviewText', seq_len=72, vocab=dls_nlp.vocab), RegressionBlock),
                    get_x=ColReader('reviewText'), 
                    get_y=ColReader('squishedDifference'),
                    splitter=ColSplitter())

I tried this with a CategoryBlock as well, creationg a 1:0 column in the df out of the squishedDifference, the issue is the same.

And here it is where it goes south:
dls_class.summary(df)

Setting-up type transforms pipelines
Collecting items from               asin  ... is_valid
0       011040047X  ...    False
1       0110400550  ...    False
2       0110400550  ...     True
3       0110400550  ...     True
4       0110400550  ...    False
...            ...  ...      ...
921501  B00LMO532S  ...    False
921502  B00LMO532S  ...     True
921503  B00LO11UBC  ...    False
921504  B00LQ6ICJ8  ...    False
921505  B00LSVC814  ...    False

[921131 rows x 9 columns]
Found 921131 items
2 datasets of sizes 736910,184221
Setting up Pipeline: ColReader -- {'cols': 'reviewText', 'pref': '', 'suff': '', 'label_delim': None} -> Tokenizer -> Numericalize

at this point it eats up all the RAM and then crashes. I have the runtime set to give me a lot of RAM, but the issue in effect looks like if it was stuck in a bad regression.