Live coding 10

Not sure how you came to this state but when running programs in the background from a terminal you can do this

program_to_run & 2>&1 > s.txt

This redirects the error output 2 = stderr coupled with the standard output 1 = stdout to a text file that can be viewed afterward via the usual processes

I don’t think that fixes Radek’s issue, since his messages are being generated by ssh trying to create a tunnel AFAICT.


Yes, yes, yes, I think that is what is happening :slight_smile: I wonder if others also suffer from this? I bet they do!

Thank you for your answer @RogerS49 nonetheless, that stderr redirection is something I know exists, but never got the finer points on how it works! Appreciate you sharing your thoughts!

In the first 6 minutes of this walk thru video a question was posed regarding Conditional Probabilities.
I would like to suggest to Daniel a python Probabilistic programming languages (PPLs) package named pyro-ppl.
This package was originated by UBER AI to determine routes etc in setting up there business.
It has since been open sourced and taken on as a Linux Project and is currently updated.
The package extends and builds on python and pytorch.distributions and is also influenced by the Edward python package. With this package you can build other distributions based on your priors and likelihoods.
It works very similar to TensorFlow Probability but possibly easier to work with.
Hope this information is of some use.

Pyro Documentation

If we use batchnorm I don’t think gradient accumulation will be mathematically identical. Though it still works fairly well so not too much of a problem. I ran into this previously when testing that fp16+Grad accum was working correctly.It is mathematically equivalent with layernorm/instance norm.

Ah yes, true.

Can some explain test-time augmentation to me (Learner.tta()).

Is the purpose to effectively increase the size of the validation set by using augmentation, something vaguely akin to k-fold cross validation (but keeping test and validation data separate)?

It outputs a 2-tuple of ([list of tuples of probs], [list of classes]). When it calculates the weighted average, is it of the probabilities, then using the highest mean prob as the prediction? And is the list of classes the labels (as opposed to the predicted class)?

I’m trying to figure out how it feeds into error_rate().

Start by reading about it in the book, then do some experiments in a notebook, and tell us what you find out – if you have any questions along the way, let us know!

(Yes, I could just tell you directly, but you’ll learn way more if you experiment yourself… :smiley: )


This walkthrough was so useful. To make sure I was understanding it I re-created, but due to issues with kernels dying in my wsl I ended up running this on my Apple M1 (cpu). So used tiny and just 3 epochs, and 3 runs across images 32,64 and 128 - then averaged them (weighting the larger images) and ended up above 94% - which surprised me. That would have been about 120 on the leaderboard :). My best is at 50 right now - time to try paperspace again (then might take a look at the Metal options with Pytorch if I get brave).

Perhaps I’m mistaken, but in walkthru 13, Jeremey mentions that we can indicate the number of independent inputs via the ImageBlock function. I will try this out, but I assume we could add the variety as input and change the n_inp to 2. Thanks for running the code without explicitly stating n_inp = 1 so we could gain a deeper understanding of the DataBlock function.

In case others run into this - I was getting an error on Paperspace - not a CUDA memory issue but “Could not do one pass in your dataloader, there is something wrong in it. Please see the stack trace below” and the bottom of the trace was “cusolver error: CUSOLVER_STATUS_EXECUTION_FAILED, when calling `cusolverDnSgetrf( handle, m, n, dA, ldda, static_cast<float*>(dataPtr.get()), ipiv, info)”. I found that reducing the image size in the resize property (not the size) avoided this. Despite dropping down from 480 to 240 I still got a good result on a large swinv2.

Finally solved the problem. It turns out it is not about the walk-thru code/data or anything else. It is all about a PyTorch bug that appears on certain Nvidia drivers.


Had the same problem, intermittently… Reducing from 480 to 360 also addressed it.


Although I just found that increasing could fix it too. I think just different values hit the PyTorch bug - so you have to see what works.

Walkthru 10 detailed note in the form of questions

The best vision models for fine-tuning notebook

00:00 - Questions on tabular data and the fastbook has the answer

Why paddy dataset is interesting


paddy dataset is similar to ImageNet in terms of shape and size but have no paddy labels

What kind of dataset can do well on fine-tuning a pre-trained model?


Is the dataset (e.g., PETS dataset) very similar to the pre-trained model’s dataset (e.g., ImageNet)?
The more similar, the better the dataset can fine tune the model by making use much of the pretrained weights

How large is the dataset, especially when the dataset is not similar e.g., the planet dataset to the Imagenet?
When datasets are very different, most of weights from the pretrained model will be useless, so the larger of the dataset, the more weights can be trained, the better the model can learn

Experiment to find out the best model for fine-tuning using similar and large dataset vs dissimilar and small dataset


If we can find the best model from PETS dataset and Planet dataset, then it may be applied to other similar senarios

Jeremy walks us through how he and Thomas Capelle designed their experiments


Explore the from fastai_timm repo

Explore the sweep_planets_lr.yaml from the repo

Weights and biases API can enable us to see our experiment results inside Jupyter notebook

What does Jeremy use gist for?


How Jeremy use WandB API to use their experiment results inside Jupyter notebook


How to turn a dataframe into a string


StringIO is the key to make sure pd.to_csv to save dataframe into a string rather than a file

How Jeremy create a gist?


import ghapi.core as gh
g = gh.GhAPI()
gist = g.create_gist('description of the gist', content_as_string, filename='', public=True)

What does Jeremy use gist for here and generally?

How to do score models with data from the gist url


How to calculate the score for all models based on their error_rate, fit_time, and GPU_mem?

How does Jeremy come up with the score design?

How to sort all the models based on their score and display the top 15 models?

#question How much does fit_time and GPU_mem matter more and when?

How to compare models (on error_rate and fit_time) by families


How to find the best error_rate models who have better than average gpu mem and fit_time


What is gpu mem and when does it matter?

Which model family is very good at fine-tuning for planet dataset


Why the best model families don’t improve accurate when model size get larger?


Because small datasets won’t help large models to learn much.

Which model/model families are best to fine-tune on non-ImageNet like dataset such as planets dataset?


What is the fastai way’s of doing parameter sweeping vs the google way to find out general insights or rules


Can we apply the findings (models, model families, good parameters) to all computer vision classifications? Yes

How many GPUs and for how long does Jeremy run the experiment? 3 GPUs for 12 hours

Why we don’t need to try every possibility on every level?

How did Jeremy pick the range of values of parameters for experiments?

How to find pre-trained models for other datasets?


to google
model zoo
paper with code
hugging face

Why Jeremy does not prefer to publish to academic journals


Jeremy wants to share knowledge more freely and openly whereas academic journals generally make it difficult.

How Jeremy try out small models based on the sweep experiment findings


How to setup the code for efficiently build and compare different models?


convnext small in22k

How the first two models differ in the comparison? with or without squish, but they both use square images for augmentation

`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

What does Resize((480, 640)) do? to reverse only 3-4 images which have opposite aspect ratio, and do nothing to the rest of images


Why Jeremy tried a rectangular size (224, 288) for model 3 and (240, 320) for model 4 when doing augmentation images? and why model 3 is expected to perform better than model 4?


How to find out whether the original image aspect ratio is (480, 640) or (640, 480)?


vit_small_patch16_224 model

Why rectangular approaches won’t be possible for this vit_small model?

How the 5th and 6th models differ? with or without squish, and as Jeremy said generally squish version works better

`Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

#question why still use Resize(480) rather than Resize(640)?

How is the 7th model built based on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)?

What is the logic behind it?

How to build models on swinv2_base_window12_192_22k?


The first time error rate down to <2%

all models must use augmentation image size 192 and Resize(480)

build two models with or without squish using `Resize(480, method=‘squish’), batch=aug_transforms(size=224, min_scale=0.75)

build a third models on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)

Jeremy found it very interesting that this swin large and slow model family works better than previous small and fast model families on even smaller resized images.

How to build models on swin_small_patch4_window7_224?


The first two are with or without squish
Resize(480, method='squish'), batch=aug_transforms(size=224, min_scale=0.75)

The third one is on Resize(640, method=ResizeMethod.Pad, pad_mode=PadMode.Zeros)

Build models on more accurate but slow large pre-trained models



Why to use a different seed number or different set of batches for doing experiments in this group of models?

How to avoid out of GPU mem problem when running large models


How does Gradient accumulation prevent out of memory problem?

How does batch size work behind the scene? Why it is necessary?


Why should not apply majority vote but use averaging probabilities


How Jeremy set up to do ensemble using those models above?


How much time Jeremy spent on all these work


Just to make it more confusing, I found that if I rerun the cell a second time, it works fine without dropping the size.

How do you resume training on a saved model after the kernel is shutdown.

I found the following link but I think start_epoch is depricated.

I loaded my model using load_leaner. I created a DataLoaders in the usual way, and attached it to the leaner using learner.dls = dls.

When I call fit_one_cycle it works, but if I train for one epoch, the accuracy rate is much worse than on the saved model. (I confirmed that inference on the saved model before additional training is fine).

You’ll need to use a much lower LR when you continue training a fine-tuned model.

Would lr_finder be the method to use for that?

Possibly. Although I normally just divide my previous LR by 5 and it works OK.