Lesson4-imdb fit times

On AWS, its taking about 25 min/epoch when running learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2) … and I just wanted to know if that sounded about right, or if something is off?

I don’t know how to gauge what is reasonable and what is not in terms of how long models take to fit an epoch’s worth of data, so any tips would be appreciated.

I believe you meant the word prediction fit not the sentiment one.

See if you can speed it up by removing 80% of the reviews from the “all” directory. I only used a few 100 text files but could learn a pretty strong encoder.

1 Like

@jeremy, could you share the weights from the model you showed in class?

1 Like

Yah exactly.

Will try your suggestions. Thanks!

I would expect that simply doing less epochs on the whole dataset would be better. I don’t think sampling makes much sense with deep learning, unless you have too much data to even run a single epoch.


So you’d recommend reducing the # of epochs by 1/2 or as much as 80%?

@narvind2003, with your smaller training set, what kind of numbers were you seeing in terms of loss when training the language model? And accuracy when using it to train your sentiment model?

I would think that using less reviews in training the language model would have a negative impact on using it for sentiment tasks as if would have learned about far less words, but maybe I’m wrong. Thoughts?

Yeah also if you’re doing less epochs due to time constraints, it may be helpful to reduce the dropout, since you’re not as likely to overfit when doing less epochs

You’re right. The accuracy is going to be low if you use a smaller dataset and learn to represent fewer words. As Jeremy says, it’s important to use all the data to increase accuracy. The only time you should sample is when you’re checking if your code is bug free/check if your GPU is working properly etc - sort of a preliminary run.

The other key insight is that the transfer works so well on the “imdb” sentiment task because we trained it for a harder problem (word prediction) on the same imdb dataset. It’s like you train for a full marathon but you’re participating in the half marathon.

If you try to learn prediction on the imdb dataset, it might not do a great job on sentiment prediction for restaurants. It might do a decent job because it still learns a great deal about polarized sentences. Since you trained for a marathon, you will be able to generally perform better at sports that require a lot of running, but it automatically doesn’t make you a star cricket player. You’ll need some specific cricket training for that. And obviously running skills will not make you any better at chess.

I mention these analogies so that you can plan to wisely use the right kind and amount of data, plus the appropriate models for the problem you’re trying to solve.


That is kind of my where my intuition on this is leading me.

Just like training an image classifier on images like the ones in ImageNet will take less training when using a pre-trained model based on ImageNet, to too will a sentiment classifier require less training when using a pre-trained language model based on text similar to that we are using the sentiment analysis task.

This leads me to believe that language models are most helpful when they represent the kind of text you are looking at for whatever problem you are trying to solve. Like, I can imagine a language model built off of tweets would be better that the one based on Imdb reviews for classifying sentiment in social media posts.


You can try predicting Arxiv paper abstracts (from lesson 4) using the imdb trained encoder as a fun experiment.

Jeremy mentioned pre-trained vectors like GloVe which are trained on large corpus like wikipedia, billions of tweets etc. They are widely used in NLP but suffer from the same issue - they struggle in narrow domain specific use-cases where language contains unseen tokens.

Sounds about right for AWS p2 instance. Alternatively, check out this discussion before using p3 instance. It will take 6-7 min/epoch.

But note these are pre-trained vectors, not pretrained models, so are far less powerful than what we used in the last lesson.


Good to know. Thanks!

@jeremy : Thanks! that’s a critical point to note. What you taught in lesson 4 is far superior than just a nice embedding layer.

Are there any good pre-trained encoders that we can springboard from? It seems like NLP could definitely benefit from these like CV has with pre-trained VGG/Resnet?

So if I upgrade my p2 instance to p3 … is there anything else I need to do?

I think you are ready to go. For the safe side, I will put %time in front of the learn.fit, so you can see the estimated run time per epoch. If it drops to 9-10 minutes/epoch for the first run, it is OK. The later epochs will run faster. If nothing improves, then re-do the setup “From the Source”.

Warning! It will take a long time to train.

You’ll need to install pytorch from source to get good performance on p3


For some reason, this is how I see downloaded data. Is it correct?

My train, test folders are empty.

1 Like

You need to create train/all and a test/all directories and

Then copy train/pos, train/neg, and train/unsup into -> train/all

Then copy test/pos, test/neg into -> test/all

This is how I do it in a notebook:

PATH = 'data/aclImdb'

os.makedirs(f'{PATH}/train/all', exist_ok=True)
os.makedirs(f'{PATH}/test/all', exist_ok=True)
os.makedirs(f'{PATH}/models', exist_ok=True)
os.makedirs(f'{PATH}/tmp', exist_ok=True)

TRN_PATH = 'train/all'
VAL_PATH = 'test/all'


uncomment these lines to copy everything into the right dirs:

# !!cp -r {PATH}/train/pos/* {TRN}/
# !!cp -r {PATH}/train/neg/* {TRN}/
# !!cp -r {PATH}/train/unsup/* {TRN}/ # have to run this line in terminal for it to work!

# !!cp -r {PATH}/test/pos/* {VAL}/
# !!cp -r {PATH}/test/neg/* {VAL}/

Thanks @wgpubs I was facing the same problem too.