Lesson 2 In-Class Discussion

That’s the definition of a cycle.

1 Like

Hello @pete.condon, I followed the following steps in my Crestle account :

  • Launch Jupyter Notebook
  • Open a new terminal
  • Using the cd command, go to courses/dl1/fastai folder
  • Download the weights file with the command : wget http://files.fast.ai/models/weights.tgz
  • Unzip the file with the command : tar -xvzf weights.tgz (the folder “weights” will be automatically created with the weights inside)
2 Likes

Thanks! One correction: that should be the courses/dl1/fastai/fastai folder. i.e. inside the fastai repo, there’s a fastai modules folder.

Great blog post from Rachel about choosing your validation set wisely!
http://www.fast.ai/2017/11/13/validation-sets/

4 Likes

The resnext notebook use 0.01 while if I use or finder it suggests 0.001. This is where I get confused.

Rule of thumb is: pick next lower order of magnitude before minimum point. That’s 0.01 in your chart

Something that’s not very intuitive to me yet is when should we/should we not use learn.unfreeze() during training?

I remember @jeremy mentioning in class that for, let’s say the dog breed competition on Kaggle, the dataset is similar or a subset of the ImageNet dataset that the resnext101_64 model was trained on.

If I retrain my network using unfreeze() in such a case, would that lead to overfitting?

learn.unfreeze() is for when you want to retrain the earlier layers of the model. For dog breed competition, choosing to re-train/finetune the earlier layers of the model won’t help the classification task because this dataset is actually a subset of ImageNet. Since the model you are using was already pre-trained on ImageNet it doesn’t help to re-train those earlier layers of the network. In most cases the dataset you are working with will be at least somewhat different form ImageNet so in that case it would help to unfreeze.

The reason we do want to retrain the top layers is because our classification task here is to detect breed of dogs (120 classes) whereas with ImageNet the classification is for 1000 different classes only some of which are dogs.

4 Likes

Yes, quite likely. Make your learning rates of the first two layer groups really low to avoid this.

(In practice, it’s rare to have datasets as similar to imagenet as these.)

2 Likes

The name is the ‘suffix’ of the URL for the kaggle competition’s page. Like for dog breeds, the url is:
https://www.kaggle.com/c/dog-breed-identification
So the key for the competition is just
dog-breed-identification

My issue was that I was using my username from kaggle instead of my email address.

hello fellow humans - I found this excellent video on youtube that I’d like to share with y’all: https://www.youtube.com/watch?v=nhqo0u1a6fw. Siraj describes the evolution of gradient descent, like how it’s evolved from gradient descent to stochastic gradient descent to mini-batch stochastic gradient descent to adaptive gradient descent and so on. This video helped me understand how these different descent techniques, called optimizers, perform and how they compare.

Note: Towards the end of the video, he says that adaptive gradient and momentum is best when data is sparse and how this is normally the case with real world data sets. So I’m left with the question, what is “enough” data for mini-batch stochastic gradient descent to out-perform the other optimizers?

I think it will be an upcoming topic in the course. But yes, even I have the same questions as yours :slight_smile:

Hello @KevinB, did you get any answer about your question about sz ?

I didn’t ever see any answer on this one

They are the sizes that imagenet models are generally trained at. You get best results if you use the same as the original training size. Since people don’t tend to mention what size was used originally, you can try using both with something like dogs v cats and see which works better. More recent models seem to generally use 299.

3 Likes

Hi @jeremy

As a follow-up to my posting the Video Timelines for Lesson 3, please find below a tentative draft for Lesson 2.
I’ll let you review it and make the necessary changes to add it to the Wiki of Lesson 2.
Best regards,

Eric.


Video timelines for Lesson 2

  • 00:01:01 Lesson 1 review, image classifier,
    PATH structure for training, learning rate,
    what are the four columns of numbers in “A Jupyter Widget”

  • 00:04:45 What is a Learning Rate (LR), LR Finder, mini-batch, ‘learn.sched.plot_lr()’ & ‘learn.sched.plot()’, ADAM optimizer intro

  • 00:15:00 How to improve your model with more data,
    avoid overfitting, use different data augmentation ‘aug_tfms=’

  • 00:18:30 More questions on using Learning Rate Finder

  • 00:24:10 Back to Data Augmentation (DA),
    ‘tfms=’ and ‘precompute=True’, visual examples of Layer detection and activation in pre-trained
    networks like ImageNet. Difference between your own computer or AWS, and Crestle.

  • 00:29:10 Why use ‘learn.precompute=False’ for Data Augmentation, impact on Accuracy / Train Loss / Validation Loss

  • 00:30:15 Why use ‘cycle_len=1’, learning rate annealing,
    cosine annealing, Stochastic Gradient Descent (SGD) with Restart approach, Ensemble; “Jeremy’s superpower”

  • 00:40:35 Save your model weights with ‘learn.save()’ & ‘learn.load()’, the folders ‘tmp’ & ‘models’

  • 00:42:45 Question on training a model “from scratch”

  • 00:43:45 Fine-tuning and differential learning rate,
    ‘learn.unfreeze()’, ‘lr=np.array()’, ‘learn.fit(lr, 3, cycle_len=1, cycle_mult=2)’

  • 00:55:30 Advanced questions: “why do smoother services correlate to more generalized networks ?” and more.

  • 01:05:30 “Is the Fast.ai library used in this course, on top of PyTorch, open-source ?” and why Fast.ai switched from Keras+TensorFlow to PyTorch, creating a high-level library on top.

PAUSE

  • 01:11:45 Classification matrix ‘plot_confusion_matrix()’

  • 01:13:45 Easy 8-steps to train a world-class image classifier

  • 01:16:30 New demo with Dog_Breeds_Identification competition on Kaggle, download/import data from Kaggle with ‘kaggle-cli’, using CSV files with Pandas. ‘pd.read_csv()’, ‘df.pivot_table()’, ‘val_idxs = get_cv_idxs()’

  • 01:29:15 Dog_Breeds initial model, image_size = 64,
    CUDA Out Of Memory (OOM) error

  • 01:32:45 Undocumented Pro-Tip from Jeremy: train on a small size, then use ‘learn.set_data()’ with a larger data set (like 299 over 224 pixels)

  • 01:36:15 Using Test Time Augmentation (‘learn.TTA()’) again

  • 01:48:10 How to improve a model/notebook on Dog_Breeds: increase the image size and use a better architecture.
    ResneXt (with an X) compared to Resnet. Warning for GPU users: the X version can 2-4 times memory, thus need to reduce Batch_Size to avoid OOM error

  • 01:53:00 Quick test on Amazon Satellite imagery competition on Kaggle, with multi-labels

  • 01:56:30 Back to your hardware deep learning setup: Crestle vs Paperspace, and AWS who gave approx $200,000 of computing credits to Fast.ai Part1 V2.
    More tips on setting up your AWS system as a Fast.ai student, Amazon Machine Image (AMI), ‘p2.xlarge’,
    ‘aws key pair’, ‘ssh-keygen’, ‘id_rsa.pub’, ‘import key pair’, ‘git pull’, ‘conda env update’, and how to shut down your $0.90 a minute with ‘Instance State => Stop’

7 Likes

Great @EricPB!

1 Like

A batch means the whole dataset whereas mini-batch is the dataset divided into different batches so that it can be loaded in memory. For example if a dataset contains 128 images then you can divide the dataset so that it contains 34 image which results in 4 mini-batch. There may be end case where the last mini-batch does not contains the same number of images like earlier mini-batches. For example if a dataset contains say 150 images and mini-batches contains 34 images, there would be 4 mini-batches with 34 images whereas the last mini-batch will have only 22 images.

2 Likes

There are two types of problems that a learning rate can cause. If learning rate is very small, it might take forever to converge or come to a minima also referred to as the vanishing gradient problem. If it is very large, it will oscillate somewhere outside the field (remember the mountain Jeremy drew).
How I see this is, the minima is a point after which the loss worsens. There is a possibility that the learning rate will make the gradient descent hop out of the field (remember the mountain Jeremy drew). On the other hand if you carefully choose a learning rate after which the loss is still improving, the problem of gradient descent hopping out is avoided as well as since the loss was improving, the vanishing gradient problem (too much time to converge or not converging) is averted as well.
Finally, I can infer that this is more of a experience based decision that the learning rate is so chosen.

Hi Rachana, so I am using Google colab and I am getting the same error that you did. Just wanted to ask whether this solved your problem?