This post is for topics related to lesson 7 of the course. This lesson is based partly on
chapter 8 of the book.
This is a wiki post - feel free to edit to add links from the lesson or other useful info.
<<< Lesson 6｜ Lesson 8 >>>
Links from the lesson
Shouldn’t it be
Does that mean that the lr_find is based on the batch size set during the data block?
Why do we need gradient accumulation rather than just using a smaller batch size? How do we pick a good batch size?
How should the Learning Rate be changed when using Gradient Accumulation? I saw this on the forum a while back:
lr = lr/(g_acc/bs)
Do you have any good recommendations which is a good GPU which is value for money at the moment? As RTX 3090Ti of 24GB memory is not needed much if using techniques like gradient accumulation?
(originally posted by
If you can wait a few more months, I’d wait for the new RTX 4000s to come out (hopefully sometime in September or October).
Will GradientAccumulation have any drawbacks?
Another question (may be stupid), I saw something called unified memory. Can we also use that when the batch size is just too large?
Would trying out Cross-validation with k-folds with same architecture makes sense to do ensembling of models?
If you have a well-functioning but large model, do you think it can make sense to train a smaller model to produce the same final activations as the large model?
Would be great to cover (to some extent) distillation/quantization techniques in the Part 2.
Yeah you want to look into knowledge distillation. It’s a very useful and common technique for making models smaller and faster, especially for practical inference applications.
Here are some relevant links I have found:
Large-scale machine learning and deep learning models are increasingly common. For instance, GPT-3 is trained on 570 GB of text and consists of 175 billion parameters. However, whilst training large models helps improve state-of-the-art performance,...
Est. reading time: 15 minutes
In recent years, deep neural networks have been successful in both industry
and academia, especially for computer vision tasks. The great success of deep
learning is mainly due to its scalability to encode large-scale data and to
maneuver billions of...
Would it be better to probably instead of splitting the data randomly, split it by disease? Would that be better for splitting? How’d we do something like that?
Do you mean stratified sampling instead of random?
Hmmm - yeah, stratified, but ensure that those diseases that are in training, are Not in the testing set.
With knowledge distillation, you can make the larger model better as well. Just use an ensemble or a different model as your targets.
Indeed! Or even the same model used as a target for distilling into the same model class works well, known as self-distillation:
More discussion here:
Another useful approach is noisy student training, which combines distillation with pseudo-labeling for even better performance:
But these are all advanced topics
ah I see what you are saying.
I don’t think this would be wise.
You want the model to get to know all the diseases you care about.
On another note, if we had a
plant_id, e.g. some sort of identifier of a single plant, then it would be very wise to split on that. E.g. NOT having the same plants across training and validation.
That would ensure your model doesn’t learn how a specific plant looks like VS a disease.
F, you’ll often see in PyTorch code the following import statement at the beginning:
import torch.nn.functional as F
But of course fastai imports that under the hood which is why Jeremy can just use
F like that…