Live coding 10

Start by reading about it in the book, then do some experiments in a notebook, and tell us what you find out – if you have any questions along the way, let us know!

(Yes, I could just tell you directly, but you’ll learn way more if you experiment yourself… :smiley: )

5 Likes

This walkthrough was so useful. To make sure I was understanding it I re-created, but due to issues with kernels dying in my wsl I ended up running this on my Apple M1 (cpu). So used tiny and just 3 epochs, and 3 runs across images 32,64 and 128 - then averaged them (weighting the larger images) and ended up above 94% - which surprised me. That would have been about 120 on the leaderboard :). My best is at 50 right now - time to try paperspace again (then might take a look at the Metal options with Pytorch if I get brave).

1 Like

Perhaps I’m mistaken, but in walkthru 13, Jeremey mentions that we can indicate the number of independent inputs via the ImageBlock function. I will try this out, but I assume we could add the variety as input and change the n_inp to 2. Thanks for running the code without explicitly stating n_inp = 1 so we could gain a deeper understanding of the DataBlock function.

In case others run into this - I was getting an error on Paperspace - not a CUDA memory issue but “Could not do one pass in your dataloader, there is something wrong in it. Please see the stack trace below” and the bottom of the trace was “cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)”. I found that reducing the image size in the resize property (not the size) avoided this. Despite dropping down from 480 to 240 I still got a good result on a large swinv2.

Finally solved the problem. It turns out it is not about the walk-thru code/data or anything else. It is all about a PyTorch bug that appears on certain Nvidia drivers.

2 Likes

Had the same problem, intermittently… Reducing from 480 to 360 also addressed it.

2 Likes

Although I just found that increasing could fix it too. I think just different values hit the PyTorch bug - so you have to see what works.

1 Like

Walkthru 10 detailed note in the form of questions

The best vision models for fine-tuning notebook

00:00 - Questions on tabular data and the fastbook has the answer

Why paddy dataset is interesting

07:06

paddy dataset is similar to ImageNet in terms of shape and size but have no paddy labels

What kind of dataset can do well on fine-tuning a pre-trained model?

08:37

Is the dataset (e.g., PETS dataset) very similar to the pre-trained model’s dataset (e.g., ImageNet)?
The more similar, the better the dataset can fine tune the model by making use much of the pretrained weights

How large is the dataset, especially when the dataset is not similar e.g., the planet dataset to the Imagenet?
When datasets are very different, most of weights from the pretrained model will be useless, so the larger of the dataset, the more weights can be trained, the better the model can learn

Experiment to find out the best model for fine-tuning using similar and large dataset vs dissimilar and small dataset

10:44

If we can find the best model from PETS dataset and Planet dataset, then it may be applied to other similar senarios

Jeremy walks us through how he and Thomas Capelle designed their experiments

11:55

Explore the fine_tune.py from fastai_timm repo

Explore the sweep_planets_lr.yaml from the repo

Weights and biases API can enable us to see our experiment results inside Jupyter notebook

What does Jeremy use gist for?

14:10

How Jeremy use WandB API to use their experiment results inside Jupyter notebook

15:00

How to turn a dataframe into a string

17:04

StringIO is the key to make sure pd.to_csv to save dataframe into a string rather than a file

How Jeremy create a gist?

17:50

import ghapi.core as gh
g = gh.GhAPI()
gist = g.create_gist('description of the gist', content_as_string, filename='', public=True)
gist.html_url

What does Jeremy use gist for here and generally?

How to do score models with data from the gist url

19:45

How to calculate the score for all models based on their error_rate, fit_time, and GPU_mem?

How does Jeremy come up with the score design?

How to sort all the models based on their score and display the top 15 models?

#question How much does fit_time and GPU_mem matter more and when?

How to compare models (on error_rate and fit_time) by families

23:02

How to find the best error_rate models who have better than average gpu mem and fit_time

24:13

What is gpu mem and when does it matter?

Which model family is very good at fine-tuning for planet dataset

25:44

Why the best model families don’t improve accurate when model size get larger?

27:08

Because small datasets won’t help large models to learn much.

Which model/model families are best to fine-tune on non-ImageNet like dataset such as planets dataset?

27:37

What is the fastai way’s of doing parameter sweeping vs the google way to find out general insights or rules

28:39

Can we apply the findings (models, model families, good parameters) to all computer vision classifications? Yes

How many GPUs and for how long does Jeremy run the experiment? 3 GPUs for 12 hours

Why we don’t need to try every possibility on every level?

How did Jeremy pick the range of values of parameters for experiments?

How to find pre-trained models for other datasets?

32:35

to google
model zoo
paper with code
hugging face

Why Jeremy does not prefer to publish to academic journals

34:34

Jeremy wants to share knowledge more freely and openly whereas academic journals generally make it difficult.

How Jeremy try out small models based on the sweep experiment findings

37:26

How to setup the code for efficiently build and compare different models?

39:43

convnext small in22k

How the first two models differ in the comparison? with or without squish, but they both use square images for augmentation

`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

What does Resize((480, 640)) do? to reverse only 3-4 images which have opposite aspect ratio, and do nothing to the rest of images

40:39

Why Jeremy tried a rectangular size (224, 288) for model 3 and (240, 320) for model 4 when doing augmentation images? and why model 3 is expected to perform better than model 4?

41:46

How to find out whether the original image aspect ratio is (480, 640) or (640, 480)?

44:07

vit_small_patch16_224 model

Why rectangular approaches won’t be possible for this vit_small model?

How the 5th and 6th models differ? with or without squish, and as Jeremy said generally squish version works better

`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

#question why still use Resize(480) rather than Resize(640)?

How is the 7th model built based on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)?

What is the logic behind it?

How to build models on swinv2_base_window12_192_22k?

44:57

The first time error rate down to <2%

all models must use augmentation image size 192 and Resize(480)

build two models with or without squish using `Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

build a third models on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)

Jeremy found it very interesting that this swin large and slow model family works better than previous small and fast model families on even smaller resized images.

How to build models on swin_small_patch4_window7_224?

45:50

The first two are with or without squish
Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75)

The third one is on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)

Build models on more accurate but slow large pre-trained models

46:06

convnext_large_in22k

Why to use a different seed number or different set of batches for doing experiments in this group of models?

How to avoid out of GPU mem problem when running large models

47:03

How does Gradient accumulation prevent out of memory problem?

How does batch size work behind the scene? Why it is necessary?

49:54

Why should not apply majority vote but use averaging probabilities

52:50

How Jeremy set up to do ensemble using those models above?

53:15

How much time Jeremy spent on all these work
53:56

4 Likes

Just to make it more confusing, I found that if I rerun the cell a second time, it works fine without dropping the size.

1 Like

How do you resume training on a saved model after the kernel is shutdown.

I found the following link but I think start_epoch is depricated.

I loaded my model using load_leaner. I created a DataLoaders in the usual way, and attached it to the leaner using learner.dls = dls.

When I call fit_one_cycle it works, but if I train for one epoch, the accuracy rate is much worse than on the saved model. (I confirmed that inference on the saved model before additional training is fine).

You’ll need to use a much lower LR when you continue training a fine-tuned model.

1 Like

Would lr_finder be the method to use for that?

Possibly. Although I normally just divide my previous LR by 5 and it works OK.

4 Likes

Yes, I have seen the same. Looking at the bug mentioned in the other thread, it looks maybe like a rounding/precision error - so if things get loaded into memory differently maybe things go ok.

1 Like

For anyone who finds this thread with a search:

One other thing that took me some time to work out is that if you are loading a model to resume training (or even do inference on large amounts of data), the load_learner function loads to the CPU by default, so everything is very slow. If you plan to do additional training, set a flag in the load_learner like this:

load_learner('path/to/file', cpu=False)

This puts it on the GPU. Took me some time to figure out what was going on and how to fix it.

tags for searches: learner is slow, put learner on GPU

2 Likes

FYI, I tried it both ways and your “divide original LR by 5” way worked better.

2 Likes

So I did my homework and I think I understand tta and its rationale

I presume that if you use tta on your test set as the validation of your model, then you would also use tta for inference if the model is put into production?

3 Likes

Yeah otherwise you wouldn’t get similar results in production.

3 Likes

I copied and ran Jeremy’s code twice using the convnext_small_in22k model and factory method from_folder. The overall result on the validation set using tta was the same, but each epoch (and the final accuracy and validation loss) were slightly different from their corresponding epochs in the other experiment. I used the same seed for each experiment. (And obviously the same hyperparameters.)

There must be some randomness, but I thought that using the same seed was supposed to eliminate this. What is the source of randomness? Does this have any implications for creating models (eg. you may get slightly different results if you rerun so run more than once)?

(I also ran using the DataBlock api and got a slightly different answer).

I encountered a problem to run this notebook:
course22/10-scaling-up-road-to-the-top-part-3.ipynb at master · fastai/course22 · GitHub

To fix this, I modified the following, which is in the pull request right now. Please take a look at the following pull request. Thanks

2 Likes