Live coding 14

This topic is for discussion of the 14th live coding session.

<<< session 13 session 15 >>>

Links from the walk-thru

What was covered

  • Test Time Augmentation.
  • Unbalanced datasets.
  • Weighted DataLoader
  • Progressive resizing
  • Paper reference management
  • Documentation. Looking at “Docments” - call for contributors
  • (please contribute here)

Video timeline by @jmp and @fmussari

00:00 - Questions
00:05 - About the concept/capability of early stoppings
04:00 - Different models, which one to use
05:25 - Gradient Boosting Machine with different model predictions
07:25 - AutoML tools
07:50 - Kaggle winners approaches, ensemble
09:00 - Test Time Augmentation (TTA): why does it improve the score?
11:00 - Training loss vs validation loss
12:30 - Averaging a few augmented versions
13:50 - Unbalanced dataset and augmentation
15:00 - On balancing datasets
15:40 - WeightedDL, Weighted DataLoader
17:55 - Weighted sampling on Diabetic Retinopathy competition
19:40 - Lets try something…
21:40 - Setting an environment variable when having multiple GPUs
21:55 - Multi target model
23:00 - Debugging
27:04 - Revise transforms to 128x128 and 5 epochs. Fine tune base case.
28:00 - Progressive resizing
29:16 - Fine tuning again but on larger 160x160 images
34:30 - Oops, small bug, restart (without creating a new learner)
37:30 - Re-run second fine-tuning
40:00 - How did you come up with the idea of progressive resizing?
41:00 - Changing things during training
42:30 - On the paper Fixing the train-test resolution discrepancy
44:15 - Fine tuning again but on larger 192x192 images
46:11 - A detour about paper reference management
48:27 - Final fine-tuning 256x192
49:30 - Looking at WeightedDL, WeightedDataLoader
57:08 - Back to the results of fine-tuning 256x192
58:20 - Question leading to look at callbacks
59:18 - About SaveModelCallback
01:00:56 - Contributing, Documentation, and looking at “Docments”
01:03:50 - Final questions: lr_find()
01:04:50 - Final questions: Training for longer, decreasing validation loss, epochs, error rate
01:06:15 - Final questions: Progressive resizing and reinitialization
01:08:00 - Final questions: Resolution independent models, trick to make TIMM resolution independent by changing positional encodings


Did a quick experiment on the Fixing the train-test resolution discrepancy paper here, I’ll document my setup below for folks.

The aim of this was a quick validation that it was worth the effort of looking into further, and verifying that it does indeed provide better test results. The setup is extremely straightforward:

  • Dataset: URLs.IMAGEWOOF
  • Model: resnet18, not pretrained
  • Batch Size: 64
  • Metrics: accuracy

I then fit for 5 epochs and took that validation. It wound up being 18.09%. This serves as my baseline.

From there I changed the augmentation size of the final Resize transform (see snippet in the notebook for the caveat of doing so).

Jeremy if we want to do something like this to make progressive resize easier perhaps how the Pipeline's are created should make a deepcopy of each transform? (See nb for what I mean)
Actually, this is the perfect way to increase the progressive resize shape/size since it adjusts both train and val

For my test I changed the final resolution to 448 rather than 320x320, as this is typically what I’d perform for progressive resizing (2x the image size).

This new upscale got 18.6%! Which is an improvement! Phenomenal.

But, the final test was whether it was worth doing fit_one_cycle one last time for a single epoch. And the results might (or might not surprise you).

This final accuracy at the upscaled size was 19.11%, beating our only-upscale-on-inference by almost 2x when comparing it against our benchmark.

So, is it worth doing? I’m not entirely sure. Where someone could pick this up is trying to see what happens if you mimic Kaggle. E.g. We have a train/val/hidden test set, and performing inference on that hidden test set at the very end and comparing the three options performed here again.

It could be worth it if you don’t have a few spare minutes on your GPU quota, but otherwise wasn’t too too impressed with the results :slight_smile:

This is also of course an n of 1, but the difference wasn’t high enough for me to think about trying across 5 runs and averaging, etc

(I of course invite all criticism and please try this yourself too! It’s a fun little exercise :slight_smile: )


This seems to link to an older video and not todays.

Oops! Uploading the latest now.

1 Like

Just discovered a tiny bug in the fastkaggle library. I’m sure it is not a big deal. The setup_comp raises error if you use a different variable for holding the competition name than comp. See the following error:

It needs changing the following line in the file of the library:

After that it shall be working perfect with custom variable names:

1 Like

@zach the accuracy in your baseline (18%) is much lower than what I’ve got executing your code on colab (44%). Could you share what version of fastai you are using?

Here is how the notebook looks like after execution on colab:

The one that comes preinstalled in colab. Will try running it again this morning!

Maybe I missed something in previous videos but is there an explanation about error and loss functions here?

The version on my instance is 2.6.3. As you can see in the gist the change to 448px crop size at test time drops the accuracy from 45% to 39%. Fitting one cycle gives an accuracy of 30%. So, the claim from the paper is not replicated in my instance of env.

But even if it were, we might have a few more issues:

  • TTA seems to fix the issue brought in the paper as the pipeline is the same then.

  • They seem to assume that the apparent pixel sizes of objects don’t vary much in the test set. This may be true for imagenet, but I doubt it is always the case. They make this assumption; otherwise, training a model to recognize objects in more pixel sizes should improve the generalization and test accuracy (as these sets do not have to have the same distributions).

  • They observe improvement when scaling test size to 1.3x of train size (from 224px to 288px), at 2x scale (448px) the performance drops again. The notebook uses the 2x scale.

  • Training for the 5th epoch on a larger crop might make the model learn more no matter the crop, as the model did not plateau. So we can’t claim 1 more epoch is better without trying to control that.

But the paper is quite interesting thank you for putting that together I will try to play with it more.

1 Like

No problem! thanks for noting the performance drop, which aligns with your behavior. There may have been some lag in some degree or some weird bug with the states that caused the notebook to get weird results (an issue on me, oops), don’t have the time to get around to test it today but will trust what you say and that my analysis is flawed!

Interested to hear more about what you find :slight_smile:

Following up on the issues above. Training for longer (20 epochs) gives us a better model where we can observe the improvement in performance when changing to 288. (Fitting one epoch degraded the performance as the LR was too large.)

After lowering the LR and fitting for 5 more epochs on 224px we get better performance than previously after changing 288, switching that model to 288 gives again better performance.

TTA with 224 is 4x slower but gives almost the same performance than switching to 288px
TTA with 288 gives the best performance I’ve seen.
Here is the colab notebook with results:


This was briefly touched on towards the end of the previous walk-thru (13): Walkthru 13 - YouTube (~65 min mark)

Not sure if you are looking for more information on them or the difference between the two (error & loss functions)? If so, it was covered in lesson 4: Practical Deep Learning for Coders Lesson 4 - YouTube (~59 min mark)


Sometimes it is not easy to understand that you need a custom error/loss function. At least for me. I need to be familiar with checking the source code. Thanks Ali.

This issue seems to have been fixed by a PR by @n-e-w . It maybe because pypi version of fastkaggle is outdated. I am wrong, sorry

I’ve tested the paper assumption on pre-trained models, if the model in most of the exp. it gives 10% - 15% improvement. But when a model is pretty good already it can be determinetal, so my guess is that if we get a more capable model the trick won’t help. Here is a summary of the results in excel. So I’m not sure It make sense to implement the full paper.

Yellow shows testing on an image 64px larger.


Hey @kurianbenoy you’re #1 on the Paddy competition! amazing job!!


This should be now fixed, just pip install fastkaggle. Jeremy had released a new version today


Now Kaggle competition grandmaster Psi(ranked 4 globally) is also in race.


Walkthru 14, a rough detailed note

00:05 Early stopping: why you shall not use it, and what to do to not use it


docs for early stopping

Why does Jeremy not use early-stopping even don’t know whether this callback exists?

Does early stopping play nicely with one cycle training or fine-tuning?

Does learning-rate have a chance to go down when early stopping?

Without learning-rate settles down, can early epochs give us better accuracy?

What does it mean when you see better accuracy in early epochs during one cycle training? 01:53

What will Jeremy do when he sees better accuracy in early epochs during one cycle training?

How does it help?

02:23 what if the last second epoch gives a better accuracy?

Why we can’t trust the model in the second last epoch to have the best weights during the training?

or why we should not use early stopping to get the model at the last second epoch?

hint: significant?

What can be the cause of the better accuracy at the second last epoch?

hint: not settled down

03:17 What’s Radek’s reason for why we should not read much into the better accuracy

hint: validation set, distribution

03:46 How much better of the accuracy should get our attention? and by what most likely?

hint: 4-500% better? architecture

03:59 Why ensembling a lot of neural nets models can easily overfit and does not help much?

hint: neuralnet is flexible enough

Why Jeremy not recommend to do ensemble on totally different types of models even though they have a better chance than similar types? 06:34

hint: mostly one type is much better than other types

When to try these ensembles may be justified?

hint: super tiny improvement to win a medal on Kaggle

07:25 Jeremy talked about AutoML tools on Lecture 6

I have not watched yet

07:50 Why Kaggle competition winners usually talking about complex ensembling methods?

hint: no more low hanging fruits

If ensembling is the top hanging fruit, then what are the low hanging fruits?

What can beat ensembling?

hint: a new novel architecture

08:54 Why Test Time Augmentation can practically improve accuracy?

hint: see it better when taking different angles

How does the final target get picked from 5 different predictions?

dig in

docs of tta


However, Jeremy did explain how tta work in previous walkthru 9, check the notes here

10:53 Why average out all the predictions from Test-time augmentation can improve the accuracy?

hint: manage the overconfidence of the model to do no/less harm

How do you know our model get overconfident?

hint: training accuracy vs validation accuracy

Is overconfident acceptable as long as it is not overfitting?

When do we need to manage overconfidence, during training vs during testing?

13:50 When dataset is unbalanced, should you ever throw away some from highly represented class?

hint: never

Should we over sample some less represented class? How to do it? 15:20

docs on WeightedDL

Why Jeremy think this over-sampling technique won’t help? 16:19

hint: training vs test data distributions, how highly unbalanced the datasets are

Tanishq shared experience of using over-sampling or unbalanced dataset to improve performance 17:57

20:54 Getting started with Jeremy’s multitask notebook

How to use one of your local multi-GPU?



Get started 22:49


debugging: why the notebook is not working this time 23:36


Reminder: what we have built previously into the multitask model 24:43

25:11 Get our base multitask model running to compare with the base model in kaggle notebook

What does Jeremy’s kaggle base model look like?

27:54 Progressive resizing

When will we usually move on from models with smaller images to ones with larger images? 28:58

33:52 How Jeremy evaluate multi-task model with the base model? Why should multi-task model be better with 20 than just 5 epochs

hint: multi-task models design (give more signals) enable more epochs without overfitting

Another interesting approach: Why use the same model and continue to train with larger images? 29:09

hint: 1. fast even with larger paddy images; 2. kind of added data aug as different sized images

30:36 How Jeremy built a progressive resize model based on padding and larger images 160 vs 128 in previous model?

continued from 34:30

How to build such progressive resizing model properly


Why Jeremy want to freeze the pretrained model and train the last layer first?

Does fine_tune call freeze first?


Does progressive mean keep changing image sizes? 36:56

hint: without changing the learner, keep changing image sizes

37:27 Why the accuracy of the first epoch of the first progressive resizing model is not as good as its previous model?

The first epoch accuracy of this progressive resizing model is better than its previous model after its first epoch. Can you guess why?

but it is worse than the previous model after its 5 epochs. Can you guess why?

38:01 A very very brief story of progressive resizing

invented by Jeremy in a Kaggle competition and took it a step further by google with a paper

39:09 Did our first progressive resizing model beat its predecessor?

a lot better: 0.029 vs 0.041


40:04 How did Jeremy invent progressive resizing?

hint: 1. why use large images to train weights when small images can do; 2. inspriation from changing learning rate during training, why not image size

42:34 Potential comparison experiment: which one is better, models training on 224 and predicting on 360 vs models training first on 360, then on 224, and finally predicting on 360

On the paper Fixing the train-test resolution discrepancy 2

44:05 Build a second progressive resizing (pr) model with even larger images without the item_tfms of first pr model above


46:10 pinboard and other app to manage papers and ‘oh, shit’ on a finding on long covid

48:25 Build and train on larger and rectangular images with padding transforms


49:34 How to use your local second GPU to work on a second notebook?


50:29 How does WeightedDataLoader work?

doc WeightedDataLoader

How to use WeightedDataLoaders according to its docs and source? 51:00


59:13 How to save model during training? Why we should not do it neither?

About SaveModelCallback

1:00:56 How to document fastai lib with just comments, and why should you do it?

fastcore’s document

1:03:47 Why Jeremy doesn’t use lr_find any more?

1:04:49 What would happen and what to do if you want to keep training after 12 epochs?

If the error-rate gets better but validation loss worse, is it overfitting?

no, it is just overconfident

What would happen to your learning rate if you continue to train with fit_one_cycle or fine_tune?

hint: learning rate up and down in cycles, so reduce learning rate by 4-5 times to keep training

What would Jeremy do to learning rate when he re-run the code above?

hint: halve lr each time/model

1:06:16 Does any layer of the model get reinitialized when doing progressive resizing due to different image sizes? No

Is convnext model resolution independent? what does it mean?

it means the model works for any image input resolution

Why convnext model can be resolution independent? 1:07:21

hint: it’s just a matter of more or less matrix patches

1:07:49 Is there a trick we can turn resolution dependent models resolution independent with timm? Interesting to see whether this trick can also work with progressive resizing

1:08:58 Why Jeremy does not train models on the entire training set (without splitting data to validation set)? and what does Jeremy do about it?

hint: ensembling, all dataset is seen, validating


Found this a handy article on handling imbalanced datasets. Demystifying PyTorch’s WeightedRandomSampler by example | by Chris Hughes | Towards Data Science

1 Like