Lesson 3: Problems with IMDB_SAMPLE sentiment analysis

bharath.bhushan · June 13, 2019, 7:36am

Hello friends,

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
data_lm = TextDataBunch.from_csv(path, 'texts.csv')
data_lm.save()
data_lm = load_data(path)
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
learn.lr_find()

This fails with:

RuntimeError: CUDA out of memory. Tried to allocate 3.18 GiB (GPU 0; 14.73 GiB total capacity; 11.55 GiB already allocated; 2.36 GiB free; 55.92 MiB cached)

bs=4
path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
data_lm = TextDataBunch.from_csv(path, 'texts.csv', bs=bs)
data_lm.save()
data_lm = load_data(path, bs=bs)
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)
learn.lr_find()

This fails with:

ValueError: Expected input batch_size (6816) to match target batch_size (4).

I have tried several variants of the code with small tweaks but have been unable to get IMDB_SAMPLE to run correctly.
I checked other fastai forum posts and other posts. I could not find any solution.

Any help is greatly appreciated. If anyone has successfully run IMDB_SAMPLE, a pointer to the exact working notebook would help (I am sure the one on fastai GitHub is not working, at least on Colab).

msrdinesh · June 13, 2019, 9:39am

Hey didn’t you mention text_cols and label_cols during while creating data_bunch. I think that’s the problem. With GPU capacity of 14GB, the code should run smoothly for bs=64. Let me know if you are still facing problem.

bharath.bhushan · June 13, 2019, 10:14am

Thanks for the quick reply.

I checked the fastai code and it seems there are two classes TextLMDataBunch and TextClasDataBunch. So I replaced TextDataBunch with TextLMDataBunch and everything is fine now.

Without explicitly mentioning the text and label cols it could pick them both (probably because they are canonically named already).

Caveat: Accuracy of the language model is not increasing beyond 28%. But that seems to be similar to the lesson notebook.

bharath.bhushan · June 13, 2019, 10:19am

I tried out another code snippet which seems to give high accuracy (70% +). But the contents of the data_lm seem suspicious (lots of numbers and xxunk and other markers mixed up). Probably there is more code to be added to this to make it work correctly before its accuracy can be measured again.

data_lm = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .label_for_lm()
                .databunch())
data_lm.save()
data_lm = load_data(path)

Below is the content of data_lm:

TextLMDataBunch;

Train: LabelList (800 items)
x: LMTextList
xxbos [ 2 5 xxunk 25 ... 10 5 0 xxunk ],xxbos [ 2 5 xxunk xxunk ... xxunk xxunk xxunk 10 ],xxbos [ 2 5 xxunk xxunk ... xxunk 6 xxunk 10 ],xxbos [ 2 5 xxunk xxunk ... 14 9 xxunk 10 ],xxbos [ 2 5 xxunk xxunk ... xxunk 14 xxunk 10 ]
y: LMLabelList
,,,,
Path: /root/.fastai/data/imdb_sample;

Valid: LabelList (200 items)
x: LMTextList
xxbos [ 2 5 xxunk xxunk ... 15 5 0 xxunk ],xxbos [ 2 xxunk xxunk xxunk ... 9 0 10 0 ],xxbos [ 2 5 xxunk xxunk ... 12 9 xxunk 10 ],xxbos [ 2 5 18 5 ... 0 14 0 10 ],xxbos [ 2 22 5 18 ... 11 6 0 0 ]
y: LMLabelList
,,,,
Path: /root/.fastai/data/imdb_sample;

Test: None

msrdinesh · June 13, 2019, 10:43am

Accuracy of 70% is suspicious in language model. To increase the accuracy of previous model, try to do lr find and run with proper learning rate. Finally it should be around 40%.