Lesson 2 In-Class Discussion

jeremy · November 13, 2017, 2:32pm

That’s the definition of a cycle.

pierreguillou · November 13, 2017, 2:53pm

Hello @pete.condon, I followed the following steps in my Crestle account :

Launch Jupyter Notebook
Open a new terminal
Using the cd command, go to courses/dl1/fastai folder
Download the weights file with the command : wget http://files.fast.ai/models/weights.tgz
Unzip the file with the command : tar -xvzf weights.tgz (the folder “weights” will be automatically created with the weights inside)

jeremy · November 13, 2017, 4:39pm

Thanks! One correction: that should be the courses/dl1/fastai/fastai folder. i.e. inside the fastai repo, there’s a fastai modules folder.

jamesrequa · November 13, 2017, 10:35pm

Great blog post from Rachel about choosing your validation set wisely!
http://www.fast.ai/2017/11/13/validation-sets/

vikbehal · November 13, 2017, 10:35pm

The resnext notebook use 0.01 while if I use or finder it suggests 0.001. This is where I get confused.

jeremy · November 13, 2017, 10:52pm

Rule of thumb is: pick next lower order of magnitude before minimum point. That’s 0.01 in your chart

A_TF57 · November 14, 2017, 12:29am

Something that’s not very intuitive to me yet is when should we/should we not use learn.unfreeze() during training?

I remember @jeremy mentioning in class that for, let’s say the dog breed competition on Kaggle, the dataset is similar or a subset of the ImageNet dataset that the resnext101_64 model was trained on.

If I retrain my network using unfreeze() in such a case, would that lead to overfitting?

jamesrequa · November 14, 2017, 12:49am

learn.unfreeze() is for when you want to retrain the earlier layers of the model. For dog breed competition, choosing to re-train/finetune the earlier layers of the model won’t help the classification task because this dataset is actually a subset of ImageNet. Since the model you are using was already pre-trained on ImageNet it doesn’t help to re-train those earlier layers of the network. In most cases the dataset you are working with will be at least somewhat different form ImageNet so in that case it would help to unfreeze.

The reason we do want to retrain the top layers is because our classification task here is to detect breed of dogs (120 classes) whereas with ImageNet the classification is for 1000 different classes only some of which are dogs.

jeremy · November 14, 2017, 12:49am

Yes, quite likely. Make your learning rates of the first two layer groups really low to avoid this.

(In practice, it’s rare to have datasets as similar to imagenet as these.)

bhollan · November 14, 2017, 2:10am

The name is the ‘suffix’ of the URL for the kaggle competition’s page. Like for dog breeds, the url is:
https://www.kaggle.com/c/dog-breed-identification
So the key for the competition is just
dog-breed-identification

My issue was that I was using my username from kaggle instead of my email address.

creviera · November 14, 2017, 2:16am

hello fellow humans - I found this excellent video on youtube that I’d like to share with y’all: https://www.youtube.com/watch?v=nhqo0u1a6fw. Siraj describes the evolution of gradient descent, like how it’s evolved from gradient descent to stochastic gradient descent to mini-batch stochastic gradient descent to adaptive gradient descent and so on. This video helped me understand how these different descent techniques, called optimizers, perform and how they compare.

Note: Towards the end of the video, he says that adaptive gradient and momentum is best when data is sparse and how this is normally the case with real world data sets. So I’m left with the question, what is “enough” data for mini-batch stochastic gradient descent to out-perform the other optimizers?

anandsaha · November 14, 2017, 2:18am

I think it will be an upcoming topic in the course. But yes, even I have the same questions as yours

–

pierreguillou · November 18, 2017, 7:42pm

Hello @KevinB, did you get any answer about your question about sz ?

KevinB · November 18, 2017, 7:47pm

I didn’t ever see any answer on this one

jeremy · November 18, 2017, 8:01pm

They are the sizes that imagenet models are generally trained at. You get best results if you use the same as the original training size. Since people don’t tend to mention what size was used originally, you can try using both with something like dogs v cats and see which works better. More recent models seem to generally use 299.

EricPB · November 20, 2017, 8:41pm

Hi @jeremy

As a follow-up to my posting the Video Timelines for Lesson 3, please find below a tentative draft for Lesson 2.
I’ll let you review it and make the necessary changes to add it to the Wiki of Lesson 2.
Best regards,

Eric.

Video timelines for Lesson 2

00:01:01 Lesson 1 review, image classifier,
PATH structure for training, learning rate,
what are the four columns of numbers in “A Jupyter Widget”
00:04:45 What is a Learning Rate (LR), LR Finder, mini-batch, ‘learn.sched.plot_lr()’ & ‘learn.sched.plot()’, ADAM optimizer intro
00:15:00 How to improve your model with more data,
avoid overfitting, use different data augmentation ‘aug_tfms=’
00:18:30 More questions on using Learning Rate Finder
00:24:10 Back to Data Augmentation (DA),
‘tfms=’ and ‘precompute=True’, visual examples of Layer detection and activation in pre-trained
networks like ImageNet. Difference between your own computer or AWS, and Crestle.
00:29:10 Why use ‘learn.precompute=False’ for Data Augmentation, impact on Accuracy / Train Loss / Validation Loss
00:30:15 Why use ‘cycle_len=1’, learning rate annealing,
cosine annealing, Stochastic Gradient Descent (SGD) with Restart approach, Ensemble; “Jeremy’s superpower”
00:40:35 Save your model weights with ‘learn.save()’ & ‘learn.load()’, the folders ‘tmp’ & ‘models’
00:42:45 Question on training a model “from scratch”
00:43:45 Fine-tuning and differential learning rate,
‘learn.unfreeze()’, ‘lr=np.array()’, ‘learn.fit(lr, 3, cycle_len=1, cycle_mult=2)’
00:55:30 Advanced questions: “why do smoother services correlate to more generalized networks ?” and more.
01:05:30 “Is the Fast.ai library used in this course, on top of PyTorch, open-source ?” and why Fast.ai switched from Keras+TensorFlow to PyTorch, creating a high-level library on top.

PAUSE

01:11:45 Classification matrix ‘plot_confusion_matrix()’
01:13:45 Easy 8-steps to train a world-class image classifier
01:16:30 New demo with Dog_Breeds_Identification competition on Kaggle, download/import data from Kaggle with ‘kaggle-cli’, using CSV files with Pandas. ‘pd.read_csv()’, ‘df.pivot_table()’, ‘val_idxs = get_cv_idxs()’
01:29:15 Dog_Breeds initial model, image_size = 64,
CUDA Out Of Memory (OOM) error
01:32:45 Undocumented Pro-Tip from Jeremy: train on a small size, then use ‘learn.set_data()’ with a larger data set (like 299 over 224 pixels)
01:36:15 Using Test Time Augmentation (‘learn.TTA()’) again
01:48:10 How to improve a model/notebook on Dog_Breeds: increase the image size and use a better architecture.
ResneXt (with an X) compared to Resnet. Warning for GPU users: the X version can 2-4 times memory, thus need to reduce Batch_Size to avoid OOM error
01:53:00 Quick test on Amazon Satellite imagery competition on Kaggle, with multi-labels
01:56:30 Back to your hardware deep learning setup: Crestle vs Paperspace, and AWS who gave approx $200,000 of computing credits to Fast.ai Part1 V2.
More tips on setting up your AWS system as a Fast.ai student, Amazon Machine Image (AMI), ‘p2.xlarge’,
‘aws key pair’, ‘ssh-keygen’, ‘id_rsa.pub’, ‘import key pair’, ‘git pull’, ‘conda env update’, and how to shut down your $0.90 a minute with ‘Instance State => Stop’

jeremy · November 20, 2017, 9:13pm

Great @EricPB!

Rinzin · March 17, 2018, 5:28pm

A batch means the whole dataset whereas mini-batch is the dataset divided into different batches so that it can be loaded in memory. For example if a dataset contains 128 images then you can divide the dataset so that it contains 34 image which results in 4 mini-batch. There may be end case where the last mini-batch does not contains the same number of images like earlier mini-batches. For example if a dataset contains say 150 images and mini-batches contains 34 images, there would be 4 mini-batches with 34 images whereas the last mini-batch will have only 22 images.

TheLariat · April 26, 2018, 3:57pm

There are two types of problems that a learning rate can cause. If learning rate is very small, it might take forever to converge or come to a minima also referred to as the vanishing gradient problem. If it is very large, it will oscillate somewhere outside the field (remember the mountain Jeremy drew).
How I see this is, the minima is a point after which the loss worsens. There is a possibility that the learning rate will make the gradient descent hop out of the field (remember the mountain Jeremy drew). On the other hand if you carefully choose a learning rate after which the loss is still improving, the problem of gradient descent hopping out is avoided as well as since the loss was improving, the vanishing gradient problem (too much time to converge or not converging) is averted as well.
Finally, I can infer that this is more of a experience based decision that the learning rate is so chosen.

Amanpradhan · May 5, 2018, 6:21pm

Hi Rachana, so I am using Google colab and I am getting the same error that you did. Just wanted to ask whether this solved your problem?