Ah yes, true.
Can some explain test-time augmentation to me (Learner.tta()
).
Is the purpose to effectively increase the size of the validation set by using augmentation, something vaguely akin to k-fold cross validation (but keeping test and validation data separate)?
It outputs a 2-tuple of ([list of tuples of probs], [list of classes]). When it calculates the weighted average, is it of the probabilities, then using the highest mean prob as the prediction? And is the list of classes the labels (as opposed to the predicted class)?
I’m trying to figure out how it feeds into error_rate()
.
Start by reading about it in the book, then do some experiments in a notebook, and tell us what you find out – if you have any questions along the way, let us know!
(Yes, I could just tell you directly, but you’ll learn way more if you experiment yourself… )
This walkthrough was so useful. To make sure I was understanding it I re-created, but due to issues with kernels dying in my wsl I ended up running this on my Apple M1 (cpu). So used tiny and just 3 epochs, and 3 runs across images 32,64 and 128 - then averaged them (weighting the larger images) and ended up above 94% - which surprised me. That would have been about 120 on the leaderboard :). My best is at 50 right now - time to try paperspace again (then might take a look at the Metal options with Pytorch if I get brave).
Perhaps I’m mistaken, but in walkthru 13, Jeremey mentions that we can indicate the number of independent inputs via the ImageBlock function. I will try this out, but I assume we could add the variety as input and change the n_inp to 2. Thanks for running the code without explicitly stating n_inp = 1 so we could gain a deeper understanding of the DataBlock function.
In case others run into this - I was getting an error on Paperspace - not a CUDA memory issue but “Could not do one pass in your dataloader, there is something wrong in it. Please see the stack trace below” and the bottom of the trace was “cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)”. I found that reducing the image size in the resize property (not the size) avoided this. Despite dropping down from 480 to 240 I still got a good result on a large swinv2.
Finally solved the problem. It turns out it is not about the walk-thru code/data or anything else. It is all about a PyTorch bug that appears on certain Nvidia drivers.
Had the same problem, intermittently… Reducing from 480 to 360 also addressed it.
Although I just found that increasing could fix it too. I think just different values hit the PyTorch bug - so you have to see what works.
Walkthru 10 detailed note in the form of questions
The best vision models for fine-tuning notebook
00:00 - Questions on tabular data and the fastbook has the answer
Why paddy dataset is interesting
paddy dataset is similar to ImageNet in terms of shape and size but have no paddy labels
What kind of dataset can do well on fine-tuning a pre-trained model?
Is the dataset (e.g., PETS dataset) very similar to the pre-trained model’s dataset (e.g., ImageNet)?
The more similar, the better the dataset can fine tune the model by making use much of the pretrained weights
How large is the dataset, especially when the dataset is not similar e.g., the planet dataset to the Imagenet?
When datasets are very different, most of weights from the pretrained model will be useless, so the larger of the dataset, the more weights can be trained, the better the model can learn
Experiment to find out the best model for fine-tuning using similar and large dataset vs dissimilar and small dataset
If we can find the best model from PETS dataset and Planet dataset, then it may be applied to other similar senarios
Jeremy walks us through how he and Thomas Capelle designed their experiments
Explore the fine_tune.py
from fastai_timm repo
Explore the sweep_planets_lr.yaml
from the repo
Weights and biases API can enable us to see our experiment results inside Jupyter notebook
What does Jeremy use gist for?
How Jeremy use WandB API to use their experiment results inside Jupyter notebook
How to turn a dataframe into a string
StringIO
is the key to make sure pd.to_csv
to save dataframe into a string rather than a file
How Jeremy create a gist?
import ghapi.core as gh
g = gh.GhAPI()
gist = g.create_gist('description of the gist', content_as_string, filename='', public=True)
gist.html_url
What does Jeremy use gist for here and generally?
How to do score models with data from the gist url
How to calculate the score
for all models based on their error_rate
, fit_time
, and GPU_mem
?
How does Jeremy come up with the score
design?
How to sort all the models based on their score
and display the top 15 models?
#question How much does fit_time
and GPU_mem
matter more and when?
How to compare models (on error_rate and fit_time) by families
How to find the best error_rate models who have better than average gpu mem and fit_time
What is gpu mem and when does it matter?
Which model family is very good at fine-tuning for planet dataset
Why the best model families don’t improve accurate when model size get larger?
Because small datasets won’t help large models to learn much.
Which model/model families are best to fine-tune on non-ImageNet like dataset such as planets dataset?
What is the fastai way’s of doing parameter sweeping vs the google way to find out general insights or rules
Can we apply the findings (models, model families, good parameters) to all computer vision classifications? Yes
How many GPUs and for how long does Jeremy run the experiment? 3 GPUs for 12 hours
Why we don’t need to try every possibility on every level?
How did Jeremy pick the range of values of parameters for experiments?
How to find pre-trained models for other datasets?
to google
model zoo
paper with code
hugging face
Why Jeremy does not prefer to publish to academic journals
Jeremy wants to share knowledge more freely and openly whereas academic journals generally make it difficult.
How Jeremy try out small models based on the sweep experiment findings
How to setup the code for efficiently build and compare different models?
convnext small in22k
How the first two models differ in the comparison? with or without squish
, but they both use square images for augmentation
`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)
What does Resize((480, 640))
do? to reverse only 3-4 images which have opposite aspect ratio, and do nothing to the rest of images
Why Jeremy tried a rectangular size (224, 288) for model 3 and (240, 320) for model 4 when doing augmentation images? and why model 3 is expected to perform better than model 4?
How to find out whether the original image aspect ratio is (480, 640) or (640, 480)?
vit_small_patch16_224 model
Why rectangular approaches won’t be possible for this vit_small model?
How the 5th and 6th models differ? with or without squish
, and as Jeremy said generally squish
version works better
`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)
#question why still use Resize(480)
rather than Resize(640)
?
How is the 7th model built based on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)
?
What is the logic behind it?
How to build models on swinv2_base_window12_192_22k?
The first time error rate down to <2%
all models must use augmentation image size 192
and Resize(480)
build two models with or without squish
using `Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)
build a third models on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)
Jeremy found it very interesting that this swin large and slow model family works better than previous small and fast model families on even smaller resized images.
How to build models on swin_small_patch4_window7_224?
The first two are with or without squish
Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75)
The third one is on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)
Build models on more accurate but slow large pre-trained models
convnext_large_in22k
Why to use a different seed number or different set of batches for doing experiments in this group of models?
How to avoid out of GPU mem problem when running large models
How does Gradient accumulation prevent out of memory problem?
How does batch size work behind the scene? Why it is necessary?
Why should not apply majority vote but use averaging probabilities
How Jeremy set up to do ensemble using those models above?
How much time Jeremy spent on all these work
53:56
Just to make it more confusing, I found that if I rerun the cell a second time, it works fine without dropping the size.
How do you resume training on a saved model after the kernel is shutdown.
I found the following link but I think start_epoch
is depricated.
I loaded my model using load_leaner
. I created a DataLoaders
in the usual way, and attached it to the leaner using learner.dls = dls
.
When I call fit_one_cycle
it works, but if I train for one epoch, the accuracy rate is much worse than on the saved model. (I confirmed that inference on the saved model before additional training is fine).
You’ll need to use a much lower LR when you continue training a fine-tuned model.
Would lr_finder
be the method to use for that?
Possibly. Although I normally just divide my previous LR by 5 and it works OK.
Yes, I have seen the same. Looking at the bug mentioned in the other thread, it looks maybe like a rounding/precision error - so if things get loaded into memory differently maybe things go ok.
For anyone who finds this thread with a search:
One other thing that took me some time to work out is that if you are loading a model to resume training (or even do inference on large amounts of data), the load_learner
function loads to the CPU by default, so everything is very slow. If you plan to do additional training, set a flag in the load_learner
like this:
load_learner('path/to/file', cpu=False)
This puts it on the GPU. Took me some time to figure out what was going on and how to fix it.
tags for searches: learner is slow, put learner on GPU
FYI, I tried it both ways and your “divide original LR by 5” way worked better.
So I did my homework and I think I understand tta and its rationale
I presume that if you use tta on your test set as the validation of your model, then you would also use tta for inference if the model is put into production?
Yeah otherwise you wouldn’t get similar results in production.