Lesson 7 official topic

This post is for topics related to lesson 7 of the course. This lesson is based partly on chapter 8 of the book.

This is a wiki post - feel free to edit to add links from the lesson or other useful info.

<<< Lesson 6Lesson 8 >>>

Lesson resources

Links from the lesson

8 Likes

Shouldn’t it be count>=64 if bs=64?

2 Likes

Does that mean that the lr_find is based on the batch size set during the data block?

2 Likes

Why do we need gradient accumulation rather than just using a smaller batch size? How do we pick a good batch size?

3 Likes

How should the Learning Rate be changed when using Gradient Accumulation? I saw this on the forum a while back:

lr = lr/(g_acc/bs)

2 Likes

Do you have any good recommendations which is a good GPU which is value for money at the moment? As RTX 3090Ti of 24GB memory is not needed much if using techniques like gradient accumulation?

(originally posted by @miwojc )

1 Like

If you can wait a few more months, I’d wait for the new RTX 4000s to come out (hopefully sometime in September or October).

3 Likes

Will GradientAccumulation have any drawbacks?

Another question (may be stupid), I saw something called unified memory. Can we also use that when the batch size is just too large?

Thanks!

Would trying out Cross-validation with k-folds with same architecture makes sense to do ensembling of models?

2 Likes

If you have a well-functioning but large model, do you think it can make sense to train a smaller model to produce the same final activations as the large model?

3 Likes

Would be great to cover (to some extent) distillation/quantization techniques in the Part 2.

2 Likes

Yeah you want to look into knowledge distillation. It’s a very useful and common technique for making models smaller and faster, especially for practical inference applications.

Here are some relevant links I have found:

11 Likes

Question:
Would it be better to probably instead of splitting the data randomly, split it by disease? Would that be better for splitting? How’d we do something like that?

1 Like

Do you mean stratified sampling instead of random?

1 Like

Hmmm - yeah, stratified, but ensure that those diseases that are in training, are Not in the testing set.

With knowledge distillation, you can make the larger model better as well. Just use an ensemble or a different model as your targets.

2 Likes

Indeed! Or even the same model used as a target for distilling into the same model class works well, known as self-distillation:

More discussion here:
https://www.microsoft.com/en-us/research/blog/three-mysteries-in-deep-learning-ensemble-knowledge-distillation-and-self-distillation/

Another useful approach is noisy student training, which combines distillation with pseudo-labeling for even better performance:

But these are all advanced topics :smile:

7 Likes

ah I see what you are saying.
I don’t think this would be wise.
You want the model to get to know all the diseases you care about.
On another note, if we had a plant_id, e.g. some sort of identifier of a single plant, then it would be very wise to split on that. E.g. NOT having the same plants across training and validation.
That would ensure your model doesn’t learn how a specific plant looks like VS a disease.

2 Likes

Regarding F, you’ll often see in PyTorch code the following import statement at the beginning:

import torch.nn.functional as F

But of course fastai imports that under the hood which is why Jeremy can just use F like that…

4 Likes