Hi!
I’m getting to know fast.ai, and so far i love it! Really great ppl, great course and great API!
I’m trying to train an NLP regression model (or classification the issue is the same) on Google Collab (paid edition), but no matter what i try, by the step .dataloaders or .summary it crashes as it runs out of working memory. I’m sitting on this issue for weeks now, read everythin i could find, tried to tweak as much as i could, but nothing helped. So, as a final resort, i’m asking for help here, maybe you spot something i couldn’t! Here is my code, as well as the notebook itself:
the notebook
df.head()
is_valid squishedDifference reviewText
True 1.000000 The case pictured is a soft violet color, but the case cover I received was a dark purple. While I'm sure the quality of the product is fine, the color is very different.
True 1.000000 case fits perfectly and I always gets compliments on it its hasn't cracked when I dropped it. wonderful and protective.
False 1.000000 Best phone case ever . Everywhere I go I get a ton of compliments on it. It was in perfect condition as well.
False 0.333333 It may look cute. This case started off pretty good the first couple of weeks. Then it started to slide off. It slid off one day I was in a parking lot phone fell face down and my glass shattered. TERRIBLE CASE!!!!!
False 0.200000 ITEM NOT SENT from Blue Top Company in Hong Kong and it's been over two months! I will report this. DO NOT use this company. Not happy at all!
dls_nlp.one_batch()
(LMTensorText([[ 2, 8, 10, ..., 4226, 61, 4369],
[ 30, 66, 222, ..., 31, 16, 182],
[ 9, 8, 574, ..., 17, 35, 43],
...,
[ 24, 18, 21, ..., 20, 423, 20],
[ 296, 22, 10, ..., 789, 16, 127],
[1171, 15, 60, ..., 43, 110, 23]], device='cuda:0'),
TensorText([[ 8, 10, 246, ..., 61, 4369, 14],
[ 66, 222, 13, ..., 16, 182, 107],
[ 8, 574, 353, ..., 35, 43, 9],
...,
[ 18, 21, 17, ..., 423, 20, 13947],
[ 22, 10, 138, ..., 16, 127, 501],
[ 15, 60, 10, ..., 110, 23, 141]], device='cuda:0'))
dls_nlp.vocab
['xxunk',
'xxpad',
'xxbos',
'xxeos',
'xxfld',
'xxrep',
'xxwrep',
'xxup',
'xxmaj',
'.',
'the',
'i',
'it',
'and',
',',
'to',
'a',
'is',
'this',
'my'
...
the DataBlock:
dls_class = DataBlock(blocks=(TextBlock.from_df('reviewText', seq_len=72, vocab=dls_nlp.vocab), RegressionBlock),
get_x=ColReader('reviewText'),
get_y=ColReader('squishedDifference'),
splitter=ColSplitter())
I tried this with a CategoryBlock as well, creationg a 1:0 column in the df out of the squishedDifference, the issue is the same.
And here it is where it goes south:
dls_class.summary(df)
Setting-up type transforms pipelines
Collecting items from asin ... is_valid
0 011040047X ... False
1 0110400550 ... False
2 0110400550 ... True
3 0110400550 ... True
4 0110400550 ... False
... ... ... ...
921501 B00LMO532S ... False
921502 B00LMO532S ... True
921503 B00LO11UBC ... False
921504 B00LQ6ICJ8 ... False
921505 B00LSVC814 ... False
[921131 rows x 9 columns]
Found 921131 items
2 datasets of sizes 736910,184221
Setting up Pipeline: ColReader -- {'cols': 'reviewText', 'pref': '', 'suff': '', 'label_delim': None} -> Tokenizer -> Numericalize
at this point it eats up all the RAM and then crashes. I have the runtime set to give me a lot of RAM, but the issue in effect looks like if it was stuck in a bad regression.