Lesson 7 official topic

A queries re: cell 5 of Kaggle notebook “Scaling Up: Road to the Top, Part 3

  1. It seems strange that only one leg of the if/else does a return.

  2. I remember from transcribing there was a question about whether count>64 should be count>=64.

  3. While testing the GPU memory required by larger model, when a teach() call succeeds, i.e fine-tuning without error,
    then gc.collect() and torch.cuda.empty_cache() work fine.
    i.e. subsequently running gpu_report() says: 15039.000 MB GPU memory
    and then running gpu_report() again says: 1541.000 MB GPU memory
    But when a “CUDA out of memory” error occurs, running gpu_report() is not releasing memory, and I can’t find a way to release mem without a restart the kernel restart. I thought maybe the traceback context might be holding some reference preventing GC, but deleting the cell so the pink outptu cell disappears doesn’t help. Any hints how to release memory after an error, without a kernel restart?
    p.s. From here I got a small script to print memory resident tensors (with counts done in Excel), but I don’t know enough to analyse it.

Resident Tensor Count
<class ‘torch.Tensor’> torch.Size([1024]) 449
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024]) 149
<class ‘torch.Tensor’> torch.Size([32, 197, 1024]) 73
<class ‘torch.Tensor’> torch.Size([3072, 1024]) 72
<class ‘torch.Tensor’> torch.Size([4096, 1024]) 72
<class ‘torch.Tensor’> torch.Size([3072]) 72
<class ‘torch.Tensor’> torch.Size([1024, 4096]) 72
<class ‘torch.Tensor’> torch.Size([1024, 1024]) 72
<class ‘torch.Tensor’> torch.Size([4096]) 72
<class ‘torch.Tensor’> torch.Size([32, 197, 4096]) 46
<class ‘torch.Tensor’> torch.Size([32, 16, 197, 197]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072, 1024]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 4096]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 1024]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096, 1024]) 24
<class ‘torch.Tensor’> torch.Size([]) 9
<class ‘torch.Tensor’> torch.Size([512]) 8

A bit of progress. Based on…


I noticed that, in a notebook, it happens more frequently if the exception is not handled within the function but by the notebook environment.

I added this exception handling around fine_tune()…

def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
        if finetune:
            learn.fine_tune(epochs, 0.01)
            return learn.tta(dl=dls.test_dl(tst_files))
            learn.fit_one_cycle(epochs, 0.01)
    except Exception as e:

which then allows half of the memory to be cleared. Below shows memory usage using the GPU for the first time after a kernel restart, and then after CUDA error…

Curiously, comparing the first and third call to gpu_report() indicates there is still a process running, and nvidia-smi run at simiilar times shows…

Interestingly, the following indicates the GPU memory is held by a defunct process (i.e. there is no current process 13656)…

$ pip install py3nvml
$ py3smi


btw, this is on Paperspace.

  1. Have a look at how the function is used to see why.
  2. It should be >=, I believe
1 Like

okay, finetune=True is used for train() only in the final run, where a return value is required, in this line…
tta_res.append(train(arch, size, item=item, accum=8)))

All other uses of train() are the CUDA sizing experimentsof this form…
train('convnext_small_in22k', 128, epochs=1, accum=2, finetune=False)
where the None return value is thrown away.

1 Like

00:00 We have explored the simplest neural net with fully connected linear layers in earlier lectures. In this lecture we will focus on tweaking first and last layers, in the next few weeks on tweaking middle part of the neuralnet.

01:04 Review of the notebook Road to Top part 2 and congrats to fastai students beat Jeremy on 1st and 2nd

02:47 What are the benefits of using larger models? What are the problems of larger models? (use up GPU memory as GPU is not as clever as CPU to find ways to free itself; so large model needs very expensive GPU) What can we do about it when GPU out of memory? first, to restart the notebook; then Jeremy is about to show us a trick to enable us to train extra large models on Kaggle, Wow!

04:39 How big is Kaggle GPU? Do you have to run notebooks on kaggle sometimes for example code competitions? Why it is good and fair to use Kaggle notebook to win leaderboard?

05:58 How did Jeremy use a 24G GPU to find out what can a 16G GPU do? How did Jeremy find out how much GPU memory will a model use? How did Jeremy choose the smallest subgroup of images as the training set? Will training the model longer take up more memory? (No) So, smallest training set + 1 epoch training can quickly tell us how much memory is needed for the model.

07:08 Jeremy then trained different models to see how much memories they used up. How much memory does convnext-small model take? Which line of code does Jeremy use to find out the GPU memory used up by the model? Which two lines of code does Jeremy use to free unnecssarily occupied memories GPU so that you don’t need to restart the kernel to run the next model?


08:04 What if a model causes a crash problem of cuda out of memory? What is GradientAccumulation? What is integer divide? (//).


What is the problem of using smaller batch size? (smaller batch size, larger volatility of learning rate and weights) How can we make the model train in smaller batch size as if it is in large batch size? How to explain GradientAccumulation in code?


12:54 What is the implication of using GradientAccumulation? How much difference is the numeric result between using GradientAccumulation and not? What is the main cause for the difference?

15:15 More questions: it should be count >= 64 in the code above when doing GradientAccumulation; lr_find uses batch size from the DataLoader;

15:55 Why not just use a smaller batch size instead of GradientAccumulation? What is the rule of thumb for picking batch sizes? How about adjusting learning rate according to the batch size?

18:00 How did Jeremy use GradientAccumulation to find out how many accum is needed to run those large models on Kaggle’s 16G GPUs? (accum=1 always out of memory, but accum=2 works for all large models).


19:54 How did Jeremy put all the models and their settings together for experimenting later? Do we have to use the size of the model’s specification for now and how about in the future?


20:52 How to run all the models with specifications without running out of memory


22:07 Why does Jeremy don’t use seed=42 here in training? What is the effect?

22:55 What is ensemble or bagging of different good deep learning architectures? Why it is useful?

23:37 How to do the ensemble of different deep learning models?


24:15 Why should we improve and submit to Kaggle everyday? How the submission history can help trace your models developments and improvement?

25:53 More questions: What is k-fold cross-validation and how can it be applied in this case? Why does Jeremy don’t use it?

28:03 Are there any drawbacks of GradientAccumulation? Any GPU recommendations?

30:55 In part 2 Jeremy may cover how to train a smaller model to do well as in large models for faster inference

31:37 Multi-target model: How to build a dataloaders to have two labels, disease type and variety type? What does a dataloader look like in show_batch? What is DataBlock? How to construct DataBlock using all the necessary parameters to build a dataloaders for two-label prediction models? How does the get_variety function to give the variety type of rice image? 35:44 How to set the data split and item and batch transformations?


37:51 How to create a model to predict both disease and variety types? Can we see predicting both disease and variety in terms of predicting 20 things, 10 for disease, 10 for variety?

38:34 What does the new model (and new dataloaders) need now to make predictions on disease?


When and how to provide our own loss function? fastai can detect appropriate loss for your datalaoders and use it by default in simple cases. In this special case, How do we create and use our custom loss for the new model?

41:24 What does F.cross_entropy do exactly? This function belong to the first and last layer, therefore we must understand them. What is the raw output of the model of predicting 5 things?


What is the formula of softmax and How to calculate it in the spreadsheet?


44:41 What is the problem of softmax? How does it make the obvious wrong prediction when given a cat image to the bear classifier?

45:43 What can we do about the problem of the softmax above? (all prediction probabilities not adding up to 1). When do you use softmax and when not to?

46:15 What is the first part of the cross_entropy loss formula?


47:03 How to calculate cross-entropy from softmax?


49:53 How to calculate binary-cross-entropy? How to understand its formula in predicting whether it is a cat or non-cat image? How to finally get the binary cross-entropy loss of a batch of 5 images?


52:19 What are two versions of cross-entropy in pytorch? and when to use each version? Which version do we use here?


53:31 With a dataloader having two targets, our new model needs to be informed what exactly is the loss func, metrics, and the size of output?


54:24 How to create a learner for prediction two targets or 20 items? How does a learner use disease and variety losses to know which 10 items are disease predictions and which 10 are variety predictions? How to combine two loss functions together? How to understand the combined loss?


57:01 How to calc error rate for disease types and variety types? How to put them together and display them during training?


57:22 How to make the new learner and how did it train? Why the multi-task model didn’t improve and even a little worse than the previous model? Why training the multi-task model longer could improve the accuracy on disease prediction? Why predicting a second thing together could help improve the prediction of the first thing? Using multi-task model did improve the result in a Kaggle fish prediction competition Jeremy did before. What are the reasons or benefits for building multi-task models?


1:00:25 How to make multi-task modeling less confusing to you? (build a multi-task for Titanic dataset from scratch; explore and experiment this notebook)

1:01:26 Where to learn more of cross-entropy post by Chris Said of binary-cross-entropy?

1:02:00 Collaborative filtering deep dive as chp 8 wihtout change. What is the dataset used? Which version of the data we are using? How to read a tsv file using pandas? How to read/understand the dataset content/columns? recommendation system industry and Radek. How does Jeremy prefer to see the data? (cross tabulated) Why the image Jeremy talking about his preferred way of seeing the data has so few empty or missing data?


1:05:30 How to fill in the missing data or gaps in the cross tabulated dataset? How to figure out whether a new user would like a particular movie which he/she has not watched before? Can we figure out what kind/genre of movie is the particular movie we are talking here? What does the type probabilities of a movie look like? What does a user’s preference probabilities look like? If we match the two sets of probabilities up, can we know how much does the user like the movie? How do we calculate that?


1:08:09 So far so good, what is the problem of the approach of doing dot product between user preference probabilities and movie type probabilities to find out our new user’s rating of the movie? (we don’t know neither of the probabilities). How are we going to deal with this problem? Can we create such movie type probabilities without knowing even the types?

1:08:55 What is the latent factors? If I don’t know anything about the movies, can we use SGD (stochastic gradient descent) to find them? Can we create a random 5 numbers as a movie’s 5 latent factors for describing the types of the movie, and figure them out later? Can we create latent factors for each user too? Now how to calc the probability of a user likes a movie? (mmult or dot product between two groups of latent factors).


1:11:28 Now the mmult or dot product can give us the prediction of how much a user likes a movie, so we can compare predictions with true label. What to do when there is a missing label or data? (we make the prediction empty or zero). Can we use SGD to improve the latent factors by comparing predictions with labels using a loss function? How to use excel solver to update latent factors using SGD and the loss?


1:13:16 Why excel is so slow on calc gradients with even small dataset? What is the basis of collaborative filtering? (if we know A likes (a, b, c) and B likes (a, b, c), then if A likes (d, e), maybe B likes (d, e) too).

1:15:22 Is the cosine of an angle between two vectors is the same thing as the dot product?

1:16:07 How do we do the things above in pytorch as they have different data format from excel? What does the dataset would look like in pytorch?


1:18:37 What is embedding? What are embedding matrix, user embeddings, and movie embeddings? (embeddings = look up something in an array). The more intimidating words created in a field, the less intimidating the field actually is.


1:20:05 What does our dataset look like before building a dataloaders on it? How to create a dataloaders for collaborative filtering using CollabDataloaders.from_df? What does its show_batch look like? How do we create the user and movie latent factors algetother?


1:22:18 How do you choose the number of latent factors in fastai?

1:23:17 How to understand looking up in excel for latent factors and doing dot product with one-hot embeddings are actually the same thing? Can we think of embeddings as a computational shortcut to multiply something by a one-hot-encoded vector? Can we think of embedding as a cook math trick of speeding up the matrix multiplication with dummy variables (without creating dummy variables nor one-hot encoded vector).


1:27:13 How to build a collaborative filtering model from scratch? How do we create a class? (as a model is a class). How do we initiate a class object by __init__? Does __init__ tell us what parameters to give in order to create a class instance? How does the class function say do? What is a super class? Where do we put it when creating a class? What does it give us? What is the super class (Module) for pytorch and fastai to use when creating a class? What does the DotProduct class look like?


1:29:57 How to understand the forward function in the DotProduct class? What does .sum(dim=1) mean? (sum each row).


1:31:39 How to create a collab learner and start training? The training is very fast even on CPU.


1:32:47 Why this collab model above is not great? (people who give ratings are people who love movies, they don’t rarely give 1, but many high ratings. Whereas the predictions have many occassions with ratings over 5). Review the sigmoid usage. How can we do sigmoid transformation to the predictions? How does this sigmoid work? Why do we use the up limit of the range 5.5 instead of 5? Does adding sigmoid always improve the result?


1:34:29 What interesting things did Jeremy observe from the dataset? (some users like to give high ratings to all movies, some tend to dislike all movies). Can we add one bias value to both user and movie latent factors to explain this interesting observation? How to use the bias factors inside the collab model?


1:38:33 Why did the upgraded model with bias get worse? (overfitting).


1:39:06 What is weight decay and How does it help? How to understand weight decay in solving the problem of overfitting?


1:41:35 How to actually use weight decay in fastai code? Does fastai have a good default for collaborative filtering like CV? How does Jeremy suggest to find the appropriate wd value for your own dataset?


1:43:47 What is regularization? What’s wrong when the weights having high values or low values? How does weight decay help balance?

1:44:38 More questions: any other rules other than Jeremy’s rule of thumb on number of latent factors, and recommendation on average rating is viable only when there are many metadata.


Discovered some promising info regarding GPU memory leaks following a “CUDA out of memory error.” Still working through it myself, but that could take some time and I didn’t want to lose the links:

[EDIT] These didn’t work out as hoped.

  1. Evaluating “1/0” to force a new exception to release resources held by the previous frame, did not work.

  2. Doing os.environ['FASTAI_TB_CLEAR_FRAMES']="1" at the top of the notebook, didn’t work.

  3. The “Custom Solutions” using @gpu_mem_restore and with gpu_mem_restore_ctx(): didn’t work

In all cases, the behaviour is unchanged, and remains as follows…

train('convnext_large_in22k', 224, epochs=1, accum=1, finetune=False)

CUDA Out Of Memory Error


Before GC: GPU:0
process 32095 uses 16263.000 MB GPU memory
Post GC: GPU:0
process 32095 uses 4141.000 MB GPU memory

Is the fact that there are ten diseases and ten rice varieties (making 20 outputs) just a coincidence?

Sorry I am late to the lesson. I came down with COVID last Tuesday, from an outdoor unmasked party. :face_with_thermometer:

1 Like

When running the large model training from the command line, the CUDA Memory Errors were clearing back to zero MB. So that seemed the way to go. The threading and multiprocessing librarys tempted me to try them first, since they facilitated nice interprocess communication with queues, but these ultimately failed since they weren’t really separate memory spaces.

Finally suceeded using the subprocess library – manged to be run successive memory sizing tests from a notebook without a kernel reset being forced my the CUDA ERROR memory leak.

Its hacky and a bit fragile, but paste the following code into one cell, and you “should” be able to convert any train() call into an xtrain() call. YMMV.

import inspect, subprocess

def xtrain(arch, aug_size, **kwargs):
    train_src = inspect.getsource(train)
    ext_src = f'''
import fastai, sys, gc, torch 
from fastai.vision.all import *


print('======== TRAINING IN EXTERNAL PROCESS ========')
print('trn_path = ', trn_path )
print('arch = {arch}' )
print('aug_size = {aug_size}' )
print( {kwargs} )

stat=0  # no error
    train('{arch}', {aug_size}, **{kwargs})
except Exception as e:
    if repr(e).find("CUDA out of memory") > 0:
        stat=1  # CUDA oom error
        stat=2  # Other error


    result = subprocess.run(["python","-c", ext_src ])
    if result.returncode == 0: 
    elif result.returncode == 1:
        print('OTHER ERROR')
# xtrain('convnext_small_in22k', 128, epochs=1, accum=1, finetune=False)
# xtrain('swinv2_large_window12_192_22k', 192, epochs=1, accum=1, finetune=False)
1 Like

As you can see in the copy I made of Jeremy’s Multi-target: Road to the Top, Part 4, after training the multi-target model to predict disease and variety, calling get_preds would return “raw” outputs and not probabilities. Both the probs and decoded are the same “raw” values:


The error_rate function returned the same value that was reported in the training.

With what we learnt in this lesson, I now know that if we wanted the probabilities, we would calculate the softmax for those raw values. And applying argmax would give us the “normal” decoded values.

But what I was wondering if this is the expected output, I was expecting the probabilities and decoded values.

Am I missing something in what I’ve done or expected from get_preds?

Thanks a lot.

No, you’re not missing anything – fastai can only do it’s auto-magical stuff for loss functions and activation functions it knows about (which doesn’t include custom functions).

1 Like

@jeremy, I was surprised to see very little mention of progressive resizing in this year’s course!
Has that practice fallen from grace?

Would love to hear your thoughts on it, and whether the juice is worth the squeeze.

Thanks a ton for another fantastic course!

1 Like

It’s in the walkthrus

1 Like

When experimenting for an appropriate accum/batchsize, is best practice (or even required), for accum to be a power of two? e.g…
train('swin_large_patch4_window7_224', 224, epochs=1, accum=2, finetune=False)

1 Like

No, but you’d want batchsize divided by accum to be an integer.

1 Like

I was stuck a while trying to work out why the following code was erroring with my targs being None

tst_files = get_image_files(workpath/'test_images').sorted()
preds,targs = learn.tta(dl=learn.dls.test_dl(tst_files))

and finally come to the head-slap conclusion that its nonsensical to try to determine an error_rate for the test_images, since they aren’t supplied with categories. Obvious is hindsight, can someone confirm so I can lock the concept in.


Yeah, targs is None


The Multi-target: Road to the Top, Part 4 notebook ends with learn.fine_tune(). A bit more is required to do a submission, but the tensors are a different shape to early parts due to the additional variety elements.

I think I’ve worked it out (the first dozen items of my ‘subm.csv’ match an earlier submission, but I’d like to check I haven’t missed anything, or whether is a better way.

Reviewing the inference result we start with…

tst_files = get_image_files(path/'test_images').sorted()
tst_dl = dls.test_dl(tst_files)
allpreds,_ = learn.get_preds(dl=tst_dl)
allvocab = np.array(learn.dls.vocab)
print(allvocab.shape, allvocab,"\n")


torch.Size([3469, 20])
(2, 10) [[‘bacterial_leaf_blight’ ‘bacterial_leaf_streak’
‘bacterial_panicle_blight’ ‘blast’ ‘brown_spot’ ‘dead_heart’
‘downy_mildew’ ‘hispa’ ‘normal’ ‘tungro’]
[‘ADT45’ ‘AndraPonni’ ‘AtchayaPonni’ ‘IR20’ ‘KarnatakaPonni’ ‘Onthanel’
‘Ponni’ ‘RR’ ‘Surya’ ‘Zonal’]]

Pulling out just the disease part…

print(vocab.shape, vocab)


torch.Size([3469, 10])
(10,) [‘bacterial_leaf_blight’ ‘bacterial_leaf_streak’
‘bacterial_panicle_blight’ ‘blast’ ‘brown_spot’ ‘dead_heart’
‘downy_mildew’ ‘hispa’ ‘normal’ ‘tungro’]


idxs = preds.argmax(dim=1)
results = pd.Series(vocab[idxs], name="idxs")
ss = pd.read_csv(path/'sample_submission.csv')
ss['label'] = results
ss.to_csv('subm.csv', index=False)
!head subm.csv



So that seems to work, but one thing I’m not sure on is the meanign of the comma in the preds.shape “(10,)” since there is no second number to separate from the first.

1 Like

Isn’t vocab a rank 1 tensor? Also curious what is the type of vocab. Is it torch tensor or L object? It’s a bit late here, i will go and try to replicate your experiments tomorrow

1 Like

Looks good to me! (10,) is how python prints a tuple with one element (and also how you create one as a literal).