Lesson4-imdb fit times

I would expect that simply doing less epochs on the whole dataset would be better. I don’t think sampling makes much sense with deep learning, unless you have too much data to even run a single epoch.

2 Likes

So you’d recommend reducing the # of epochs by 1/2 or as much as 80%?

@narvind2003, with your smaller training set, what kind of numbers were you seeing in terms of loss when training the language model? And accuracy when using it to train your sentiment model?

I would think that using less reviews in training the language model would have a negative impact on using it for sentiment tasks as if would have learned about far less words, but maybe I’m wrong. Thoughts?

Yeah also if you’re doing less epochs due to time constraints, it may be helpful to reduce the dropout, since you’re not as likely to overfit when doing less epochs

You’re right. The accuracy is going to be low if you use a smaller dataset and learn to represent fewer words. As Jeremy says, it’s important to use all the data to increase accuracy. The only time you should sample is when you’re checking if your code is bug free/check if your GPU is working properly etc - sort of a preliminary run.

The other key insight is that the transfer works so well on the “imdb” sentiment task because we trained it for a harder problem (word prediction) on the same imdb dataset. It’s like you train for a full marathon but you’re participating in the half marathon.

If you try to learn prediction on the imdb dataset, it might not do a great job on sentiment prediction for restaurants. It might do a decent job because it still learns a great deal about polarized sentences. Since you trained for a marathon, you will be able to generally perform better at sports that require a lot of running, but it automatically doesn’t make you a star cricket player. You’ll need some specific cricket training for that. And obviously running skills will not make you any better at chess.

I mention these analogies so that you can plan to wisely use the right kind and amount of data, plus the appropriate models for the problem you’re trying to solve.

3 Likes

That is kind of my where my intuition on this is leading me.

Just like training an image classifier on images like the ones in ImageNet will take less training when using a pre-trained model based on ImageNet, to too will a sentiment classifier require less training when using a pre-trained language model based on text similar to that we are using the sentiment analysis task.

This leads me to believe that language models are most helpful when they represent the kind of text you are looking at for whatever problem you are trying to solve. Like, I can imagine a language model built off of tweets would be better that the one based on Imdb reviews for classifying sentiment in social media posts.

Yes.

You can try predicting Arxiv paper abstracts (from lesson 4) using the imdb trained encoder as a fun experiment.

Jeremy mentioned pre-trained vectors like GloVe which are trained on large corpus like wikipedia, billions of tweets etc. They are widely used in NLP but suffer from the same issue - they struggle in narrow domain specific use-cases where language contains unseen tokens.

Sounds about right for AWS p2 instance. Alternatively, check out this discussion before using p3 instance. It will take 6-7 min/epoch.

But note these are pre-trained vectors, not pretrained models, so are far less powerful than what we used in the last lesson.

2 Likes

Good to know. Thanks!

@jeremy : Thanks! that’s a critical point to note. What you taught in lesson 4 is far superior than just a nice embedding layer.

Are there any good pre-trained encoders that we can springboard from? It seems like NLP could definitely benefit from these like CV has with pre-trained VGG/Resnet?

So if I upgrade my p2 instance to p3 … is there anything else I need to do?

I think you are ready to go. For the safe side, I will put %time in front of the learn.fit, so you can see the estimated run time per epoch. If it drops to 9-10 minutes/epoch for the first run, it is OK. The later epochs will run faster. If nothing improves, then re-do the setup “From the Source”.

Warning! It will take a long time to train.

You’ll need to install pytorch from source to get good performance on p3

2 Likes

For some reason, this is how I see downloaded data. Is it correct?


My train, test folders are empty.

1 Like

You need to create train/all and a test/all directories and

Then copy train/pos, train/neg, and train/unsup into -> train/all

Then copy test/pos, test/neg into -> test/all

This is how I do it in a notebook:

PATH = 'data/aclImdb'

os.makedirs(f'{PATH}/train/all', exist_ok=True)
os.makedirs(f'{PATH}/test/all', exist_ok=True)
os.makedirs(f'{PATH}/models', exist_ok=True)
os.makedirs(f'{PATH}/tmp', exist_ok=True)

TRN_PATH = 'train/all'
VAL_PATH = 'test/all'

TRN = f'{PATH}/{TRN_PATH}'
VAL = f'{PATH}/{VAL_PATH}'

uncomment these lines to copy everything into the right dirs:

# !!cp -r {PATH}/train/pos/* {TRN}/
# !!cp -r {PATH}/train/neg/* {TRN}/
# !!cp -r {PATH}/train/unsup/* {TRN}/ # have to run this line in terminal for it to work!

# !!cp -r {PATH}/test/pos/* {VAL}/
# !!cp -r {PATH}/test/neg/* {VAL}/
3 Likes

Thanks @wgpubs I was facing the same problem too.

Note that the updated version of the IMDB notebook suggests downloading the data from here: http://files.fast.ai/data/aclImdb.tgz . This version already has the all folder created for you.

3 Likes

In the benefit of not spawning off too many threads, posting a problem I’m currently facing here.

Am currently trying out the lesson4 notebook on my Windows 10 machine + pytorch GPU (based on the latest update from Jeremy).

Came upon below error. Any ideas?

Update: Ran the same code in AWS and didn’t get the error.

this, it would be great if @jeremy could share the trained model so we dont have to run through the whole process. Thanks!

hi, I am having a similar problem where each epoch is taking around 20 minutes to run, am working on a paperspace gpu+; I am just wondering if this is normal/unavoidable or if there is something I can do to speed up the process!