Lesson 7 official topic

jeremy · June 26, 2022, 3:25am

This post is for topics related to lesson 7 of the course. This lesson is based partly on chapter 8 of the book.

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 6｜ Lesson 8 >>>

Lesson resources

Recording
Notebooks for this lesson:
- Road to the top: part 3 and part 4
- Collaborative Filtering Deep Dive
Spreadsheets for this lesson:
- Softmax and cross-entropy
- Collaborative filterings and embeddings
Things that confused me about cross-entropy by Chris Said
Label Smoothing Explained using Microsoft Excel by Aman Arora
Solutions to chapter 8 questionnaire from the book.

Links from the lesson

Course repo
a rough but detailed note

Tamori · June 28, 2022, 8:17am

Shouldn’t it be count>=64 if bs=64?

madhavajay · June 28, 2022, 8:17am

Does that mean that the lr_find is based on the batch size set during the data block?

edwardjross · June 28, 2022, 8:18am

Why do we need gradient accumulation rather than just using a smaller batch size? How do we pick a good batch size?

checkmate404 · June 28, 2022, 8:18am

How should the Learning Rate be changed when using Gradient Accumulation? I saw this on the forum a while back:

lr = lr/(g_acc/bs)

kurianbenoy · June 28, 2022, 8:19am

Do you have any good recommendations which is a good GPU which is value for money at the moment? As RTX 3090Ti of 24GB memory is not needed much if using techniques like gradient accumulation?

(originally posted by @miwojc )

Tamori · June 28, 2022, 8:22am

If you can wait a few more months, I’d wait for the new RTX 4000s to come out (hopefully sometime in September or October).

Sunny8192 · June 28, 2022, 8:26am

Will GradientAccumulation have any drawbacks?

Another question (may be stupid), I saw something called unified memory. Can we also use that when the batch size is just too large?

Thanks!

kurianbenoy · June 28, 2022, 8:28am

Would trying out Cross-validation with k-folds with same architecture makes sense to do ensembling of models?

msp · June 28, 2022, 8:34am

If you have a well-functioning but large model, do you think it can make sense to train a smaller model to produce the same final activations as the large model?

devforfu · June 28, 2022, 8:35am

Would be great to cover (to some extent) distillation/quantization techniques in the Part 2.

ilovescience · June 28, 2022, 8:37am

Yeah you want to look into knowledge distillation. It’s a very useful and common technique for making models smaller and faster, especially for practical inference applications.

Here are some relevant links I have found:

Zakia · June 28, 2022, 8:44am

Question:
Would it be better to probably instead of splitting the data randomly, split it by disease? Would that be better for splitting? How’d we do something like that?

FraPochetti · June 28, 2022, 8:46am

Do you mean stratified sampling instead of random?

Zakia · June 28, 2022, 8:50am

Hmmm - yeah, stratified, but ensure that those diseases that are in training, are Not in the testing set.

piotr.czapla · June 28, 2022, 8:50am

With knowledge distillation, you can make the larger model better as well. Just use an ensemble or a different model as your targets.

ilovescience · June 28, 2022, 8:56am

Indeed! Or even the same model used as a target for distilling into the same model class works well, known as self-distillation:

More discussion here:
https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/

Another useful approach is noisy student training, which combines distillation with pseudo-labeling for even better performance:

But these are all advanced topics

FraPochetti · June 28, 2022, 8:57am

ah I see what you are saying.
I don’t think this would be wise.
You want the model to get to know all the diseases you care about.
On another note, if we had a plant_id, e.g. some sort of identifier of a single plant, then it would be very wise to split on that. E.g. NOT having the same plants across training and validation.
That would ensure your model doesn’t learn how a specific plant looks like VS a disease.

ilovescience · June 28, 2022, 9:00am

Regarding F, you’ll often see in PyTorch code the following import statement at the beginning:

import torch.nn.functional as F

But of course fastai imports that under the hood which is why Jeremy can just use F like that…