Lesson 7 official topic

For some reason I completely missed the link to the live stream, so I had to watch the lesson later and could not ask the questions live :frowning:

Anyways, I was thinking:

  1. Jeremy mentioned that you could use collab filtering as the basis for a recommendation system, e.g. with users/product bought. But in this case I am not sure how it would work, as we would only have positive examples (i.e. users who have bought something). If we treat the missing values as negatives (i.e. user has not bought X hence they do not like it), then the matrix is complete, so there is nothing to train. On the other hand if we treat the missing values as missing, all the labels will be “1”. Don’t know how we could solve this :thinking:
  2. I often find that I have quite a lot of metadata about both the items and users, which seems wasteful not to use. My intuition says that it could and should be used to create embedding to use as a base for collab filtering later (either to replace the embeddings to dot product or to somehow enhance them), but I am not sure how that would work in practice. Any ideas?
  3. It seems like collab filtering works best when we have many-to-many relationships between the users and the items. Can we make it work, for one-to-many relationships? E.g. Imagine that I have a bunch of sales people and customers and I want to match a new customer to the sales person who’s more likely to close the deal. In this case each salesperson is matched to many closed/missed deals, but each customer is matched to a single salesperson. Once again, I am not sure how to make it work, but I feel like collaborative filtering should be usable to create some sort of ranking…
3 Likes

You often have the list of things presented to the user as well the list of things he decided to click. The presented and not clicked are negative examples. Here is how airbnb is doing this: Real-time Personalization using Embeddings for Search Ranking at Airbnb - YouTube

But It is a very good observation, something I haven’t thought of when I was adding the analytics to my shop :slight_smile: ,

6 Likes

The metadata can be used as embeddings, but it is even better to address the cold start problem when you only have metadata and no preferences to train the model.

I guess you thought about making embeddings from the metadata you have, sounds like a good plan.

I am not sure I understand this: do you mean something like:

  • Start by training a model that predicts the score only based on the metadata
  • Use that model to extract the embeddings for user and items (which can obviously get around the cold start problem) to use in collab filtering?

How would the second step work in practice? Would I use the embeddings from the first step to initialise the embeddings the collab filter? Would I concatenate the embeddings of the collab filter with the metadata? Sum them? Both? So many options…

Another thought I had was that the metadata could be used as a basis for the bias. In other words: use the embeddings from step 1 as a further input to the collab filter, stick a linear layer or an MLP on top of it (which would be trained with the collab filter) and use it as bias.

My main doubt is that I would love to find a way to train everything end to end, so without the need to pretrain on metadata first and collab filter second

Thanks for the link. Yes, I feared that the answer to the question is: go get some negative examples :joy:

1 Like

Yes, that is my only idea at the problem (hence my previous question), but I was wondering if the community has other ideas

Intuitively (based on my naive ML knowledge so far), distillation should help reduce overfitting by reducing the number of “spare” neurons to memorise input data.

From that link I found the description of the second point insightful…

During the training process of a neural network under the “multi-view” [hypothesis], the network will:

  1. Quickly learn a subset of these view features depending on the randomness used in the learning process.

  2. Memorize the small number of remaining data that cannot be classified correctly using these view features.

The first point implies that ensemble of different networks will collect all these learnable view features, hence achieving a higher test accuracy.

The second point implies that individual models do not learn all the view features not because they do not have enough capacity, but rather because there are not sufficiently many training data left to learn these views. Most of the data has already been classified correctly with existing view features, so they essentially provide no gradient at this stage of training.

Thx. Looks like a useful technique to shrink a large cloud trained model down to a smaller inferencing model to fit on a microcontroller, which I find really interesting since I bought one these a few days ago… ESP32-S3-EYE Espressif Systems | Mouser Australia.

For those unfamiliar, ESP32 is an awesome microcontroller including: 240Mhz dual core, Wifi, Bluetooth, and a bundle of peripherals. The “S3” is an ML optimised model and the “EYE” adds a 2 megapixel camera producing 1600 x 1200 images, which seems heaps considering the performance achieved with smaller images in the course so far.

I envisage all sorts of fun combining machine learning using this board’s camera interacting with the real world similar to this simple ESP32 web server.

1 Like

Keep us posted on your adventures with ESP32 board and small models on it. I recently got a Freenove car kit with esp32 and camera module.

I’ve only used the obstacle avoidance code so far, but would like to have the camera captures analyzed and control commands sent back to the car eventually to navigate freely in an environment without having to pretrain it on a set course.

1 Like

A queries re: cell 5 of Kaggle notebook “Scaling Up: Road to the Top, Part 3

  1. It seems strange that only one leg of the if/else does a return.

  2. I remember from transcribing there was a question about whether count>64 should be count>=64.

  3. While testing the GPU memory required by larger model, when a teach() call succeeds, i.e fine-tuning without error,
    then gc.collect() and torch.cuda.empty_cache() work fine.
    i.e. subsequently running gpu_report() says: 15039.000 MB GPU memory
    and then running gpu_report() again says: 1541.000 MB GPU memory
    .
    But when a “CUDA out of memory” error occurs, running gpu_report() is not releasing memory, and I can’t find a way to release mem without a restart the kernel restart. I thought maybe the traceback context might be holding some reference preventing GC, but deleting the cell so the pink outptu cell disappears doesn’t help. Any hints how to release memory after an error, without a kernel restart?
    .
    p.s. From here I got a small script to print memory resident tensors (with counts done in Excel), but I don’t know enough to analyse it.

Resident Tensor Count
<class ‘torch.Tensor’> torch.Size([1024]) 449
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024]) 149
<class ‘torch.Tensor’> torch.Size([32, 197, 1024]) 73
<class ‘torch.Tensor’> torch.Size([3072, 1024]) 72
<class ‘torch.Tensor’> torch.Size([4096, 1024]) 72
<class ‘torch.Tensor’> torch.Size([3072]) 72
<class ‘torch.Tensor’> torch.Size([1024, 4096]) 72
<class ‘torch.Tensor’> torch.Size([1024, 1024]) 72
<class ‘torch.Tensor’> torch.Size([4096]) 72
<class ‘torch.Tensor’> torch.Size([32, 197, 4096]) 46
<class ‘torch.Tensor’> torch.Size([32, 16, 197, 197]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072, 1024]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 4096]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([3072]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([1024, 1024]) 24
<class ‘torch.nn.parameter.Parameter’> torch.Size([4096, 1024]) 24
<class ‘torch.Tensor’> torch.Size([]) 9
<class ‘torch.Tensor’> torch.Size([512]) 8

A bit of progress. Based on…

saying…

I noticed that, in a notebook, it happens more frequently if the exception is not handled within the function but by the notebook environment.

I added this exception handling around fine_tune()…

def train(arch, size, item=Resize(480, method='squish'), accum=1, finetune=True, epochs=12):
    dls = ImageDataLoaders.from_folder(trn_path, valid_pct=0.2, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    try:
        if finetune:
            learn.fine_tune(epochs, 0.01)
            return learn.tta(dl=dls.test_dl(tst_files))
        else:
            learn.unfreeze()
            learn.fit_one_cycle(epochs, 0.01)
    except Exception as e:
        print(e)

which then allows half of the memory to be cleared. Below shows memory usage using the GPU for the first time after a kernel restart, and then after CUDA error…

Curiously, comparing the first and third call to gpu_report() indicates there is still a process running, and nvidia-smi run at simiilar times shows…

Interestingly, the following indicates the GPU memory is held by a defunct process (i.e. there is no current process 13656)…

$ pip install py3nvml
$ py3smi

image

btw, this is on Paperspace.

  1. Have a look at how the function is used to see why.
  2. It should be >=, I believe
1 Like

okay, finetune=True is used for train() only in the final run, where a return value is required, in this line…
tta_res.append(train(arch, size, item=item, accum=8)))

All other uses of train() are the CUDA sizing experimentsof this form…
train('convnext_small_in22k', 128, epochs=1, accum=2, finetune=False)
where the None return value is thrown away.

1 Like

00:00 We have explored the simplest neural net with fully connected linear layers in earlier lectures. In this lecture we will focus on tweaking first and last layers, in the next few weeks on tweaking middle part of the neuralnet.

01:04 Review of the notebook Road to Top part 2 and congrats to fastai students beat Jeremy on 1st and 2nd

02:47 What are the benefits of using larger models? What are the problems of larger models? (use up GPU memory as GPU is not as clever as CPU to find ways to free itself; so large model needs very expensive GPU) What can we do about it when GPU out of memory? first, to restart the notebook; then Jeremy is about to show us a trick to enable us to train extra large models on Kaggle, Wow!

04:39 How big is Kaggle GPU? Do you have to run notebooks on kaggle sometimes for example code competitions? Why it is good and fair to use Kaggle notebook to win leaderboard?

05:58 How did Jeremy use a 24G GPU to find out what can a 16G GPU do? How did Jeremy find out how much GPU memory will a model use? How did Jeremy choose the smallest subgroup of images as the training set? Will training the model longer take up more memory? (No) So, smallest training set + 1 epoch training can quickly tell us how much memory is needed for the model.

07:08 Jeremy then trained different models to see how much memories they used up. How much memory does convnext-small model take? Which line of code does Jeremy use to find out the GPU memory used up by the model? Which two lines of code does Jeremy use to free unnecssarily occupied memories GPU so that you don’t need to restart the kernel to run the next model?

Images

08:04 What if a model causes a crash problem of cuda out of memory? What is GradientAccumulation? What is integer divide? (//).

Images

What is the problem of using smaller batch size? (smaller batch size, larger volatility of learning rate and weights) How can we make the model train in smaller batch size as if it is in large batch size? How to explain GradientAccumulation in code?

Images

12:54 What is the implication of using GradientAccumulation? How much difference is the numeric result between using GradientAccumulation and not? What is the main cause for the difference?

15:15 More questions: it should be count >= 64 in the code above when doing GradientAccumulation; lr_find uses batch size from the DataLoader;

15:55 Why not just use a smaller batch size instead of GradientAccumulation? What is the rule of thumb for picking batch sizes? How about adjusting learning rate according to the batch size?

18:00 How did Jeremy use GradientAccumulation to find out how many accum is needed to run those large models on Kaggle’s 16G GPUs? (accum=1 always out of memory, but accum=2 works for all large models).

Images

19:54 How did Jeremy put all the models and their settings together for experimenting later? Do we have to use the size of the model’s specification for now and how about in the future?

Images


20:52 How to run all the models with specifications without running out of memory

Images

22:07 Why does Jeremy don’t use seed=42 here in training? What is the effect?

22:55 What is ensemble or bagging of different good deep learning architectures? Why it is useful?

23:37 How to do the ensemble of different deep learning models?

Images




24:15 Why should we improve and submit to Kaggle everyday? How the submission history can help trace your models developments and improvement?

25:53 More questions: What is k-fold cross-validation and how can it be applied in this case? Why does Jeremy don’t use it?

28:03 Are there any drawbacks of GradientAccumulation? Any GPU recommendations?

30:55 In part 2 Jeremy may cover how to train a smaller model to do well as in large models for faster inference

31:37 Multi-target model: How to build a dataloaders to have two labels, disease type and variety type? What does a dataloader look like in show_batch? What is DataBlock? How to construct DataBlock using all the necessary parameters to build a dataloaders for two-label prediction models? How does the get_variety function to give the variety type of rice image? 35:44 How to set the data split and item and batch transformations?

Images

37:51 How to create a model to predict both disease and variety types? Can we see predicting both disease and variety in terms of predicting 20 things, 10 for disease, 10 for variety?

38:34 What does the new model (and new dataloaders) need now to make predictions on disease?

Images

When and how to provide our own loss function? fastai can detect appropriate loss for your datalaoders and use it by default in simple cases. In this special case, How do we create and use our custom loss for the new model?

41:24 What does F.cross_entropy do exactly? This function belong to the first and last layer, therefore we must understand them. What is the raw output of the model of predicting 5 things?

Images

What is the formula of softmax and How to calculate it in the spreadsheet?

Images

44:41 What is the problem of softmax? How does it make the obvious wrong prediction when given a cat image to the bear classifier?

45:43 What can we do about the problem of the softmax above? (all prediction probabilities not adding up to 1). When do you use softmax and when not to?

46:15 What is the first part of the cross_entropy loss formula?

Images

47:03 How to calculate cross-entropy from softmax?

Images

49:53 How to calculate binary-cross-entropy? How to understand its formula in predicting whether it is a cat or non-cat image? How to finally get the binary cross-entropy loss of a batch of 5 images?

Images

52:19 What are two versions of cross-entropy in pytorch? and when to use each version? Which version do we use here?

Images


53:31 With a dataloader having two targets, our new model needs to be informed what exactly is the loss func, metrics, and the size of output?

Images

54:24 How to create a learner for prediction two targets or 20 items? How does a learner use disease and variety losses to know which 10 items are disease predictions and which 10 are variety predictions? How to combine two loss functions together? How to understand the combined loss?

Images




57:01 How to calc error rate for disease types and variety types? How to put them together and display them during training?

Images

57:22 How to make the new learner and how did it train? Why the multi-task model didn’t improve and even a little worse than the previous model? Why training the multi-task model longer could improve the accuracy on disease prediction? Why predicting a second thing together could help improve the prediction of the first thing? Using multi-task model did improve the result in a Kaggle fish prediction competition Jeremy did before. What are the reasons or benefits for building multi-task models?

Images


1:00:25 How to make multi-task modeling less confusing to you? (build a multi-task for Titanic dataset from scratch; explore and experiment this notebook)

1:01:26 Where to learn more of cross-entropy post by Chris Said of binary-cross-entropy?

1:02:00 Collaborative filtering deep dive as chp 8 wihtout change. What is the dataset used? Which version of the data we are using? How to read a tsv file using pandas? How to read/understand the dataset content/columns? recommendation system industry and Radek. How does Jeremy prefer to see the data? (cross tabulated) Why the image Jeremy talking about his preferred way of seeing the data has so few empty or missing data?

Images



1:05:30 How to fill in the missing data or gaps in the cross tabulated dataset? How to figure out whether a new user would like a particular movie which he/she has not watched before? Can we figure out what kind/genre of movie is the particular movie we are talking here? What does the type probabilities of a movie look like? What does a user’s preference probabilities look like? If we match the two sets of probabilities up, can we know how much does the user like the movie? How do we calculate that?

Images



1:08:09 So far so good, what is the problem of the approach of doing dot product between user preference probabilities and movie type probabilities to find out our new user’s rating of the movie? (we don’t know neither of the probabilities). How are we going to deal with this problem? Can we create such movie type probabilities without knowing even the types?

1:08:55 What is the latent factors? If I don’t know anything about the movies, can we use SGD (stochastic gradient descent) to find them? Can we create a random 5 numbers as a movie’s 5 latent factors for describing the types of the movie, and figure them out later? Can we create latent factors for each user too? Now how to calc the probability of a user likes a movie? (mmult or dot product between two groups of latent factors).

Images




1:11:28 Now the mmult or dot product can give us the prediction of how much a user likes a movie, so we can compare predictions with true label. What to do when there is a missing label or data? (we make the prediction empty or zero). Can we use SGD to improve the latent factors by comparing predictions with labels using a loss function? How to use excel solver to update latent factors using SGD and the loss?

Images




1:13:16 Why excel is so slow on calc gradients with even small dataset? What is the basis of collaborative filtering? (if we know A likes (a, b, c) and B likes (a, b, c), then if A likes (d, e), maybe B likes (d, e) too).

1:15:22 Is the cosine of an angle between two vectors is the same thing as the dot product?

1:16:07 How do we do the things above in pytorch as they have different data format from excel? What does the dataset would look like in pytorch?

Images

1:18:37 What is embedding? What are embedding matrix, user embeddings, and movie embeddings? (embeddings = look up something in an array). The more intimidating words created in a field, the less intimidating the field actually is.

Images

1:20:05 What does our dataset look like before building a dataloaders on it? How to create a dataloaders for collaborative filtering using CollabDataloaders.from_df? What does its show_batch look like? How do we create the user and movie latent factors algetother?

Images




1:22:18 How do you choose the number of latent factors in fastai?

1:23:17 How to understand looking up in excel for latent factors and doing dot product with one-hot embeddings are actually the same thing? Can we think of embeddings as a computational shortcut to multiply something by a one-hot-encoded vector? Can we think of embedding as a cook math trick of speeding up the matrix multiplication with dummy variables (without creating dummy variables nor one-hot encoded vector).

Images



1:27:13 How to build a collaborative filtering model from scratch? How do we create a class? (as a model is a class). How do we initiate a class object by __init__? Does __init__ tell us what parameters to give in order to create a class instance? How does the class function say do? What is a super class? Where do we put it when creating a class? What does it give us? What is the super class (Module) for pytorch and fastai to use when creating a class? What does the DotProduct class look like?

Images




1:29:57 How to understand the forward function in the DotProduct class? What does .sum(dim=1) mean? (sum each row).

Images

1:31:39 How to create a collab learner and start training? The training is very fast even on CPU.

Images


1:32:47 Why this collab model above is not great? (people who give ratings are people who love movies, they don’t rarely give 1, but many high ratings. Whereas the predictions have many occassions with ratings over 5). Review the sigmoid usage. How can we do sigmoid transformation to the predictions? How does this sigmoid work? Why do we use the up limit of the range 5.5 instead of 5? Does adding sigmoid always improve the result?

Images

1:34:29 What interesting things did Jeremy observe from the dataset? (some users like to give high ratings to all movies, some tend to dislike all movies). Can we add one bias value to both user and movie latent factors to explain this interesting observation? How to use the bias factors inside the collab model?

Images



1:38:33 Why did the upgraded model with bias get worse? (overfitting).

Images

1:39:06 What is weight decay and How does it help? How to understand weight decay in solving the problem of overfitting?

Images

1:41:35 How to actually use weight decay in fastai code? Does fastai have a good default for collaborative filtering like CV? How does Jeremy suggest to find the appropriate wd value for your own dataset?

Images


1:43:47 What is regularization? What’s wrong when the weights having high values or low values? How does weight decay help balance?

1:44:38 More questions: any other rules other than Jeremy’s rule of thumb on number of latent factors, and recommendation on average rating is viable only when there are many metadata.

7 Likes

Discovered some promising info regarding GPU memory leaks following a “CUDA out of memory error.” Still working through it myself, but that could take some time and I didn’t want to lose the links:

[EDIT] These didn’t work out as hoped.

  1. Evaluating “1/0” to force a new exception to release resources held by the previous frame, did not work.

  2. Doing os.environ['FASTAI_TB_CLEAR_FRAMES']="1" at the top of the notebook, didn’t work.

  3. The “Custom Solutions” using @gpu_mem_restore and with gpu_mem_restore_ctx(): didn’t work

In all cases, the behaviour is unchanged, and remains as follows…

train('convnext_large_in22k', 224, epochs=1, accum=1, finetune=False)

CUDA Out Of Memory Error

report_gpu() 

Before GC: GPU:0
process 32095 uses 16263.000 MB GPU memory
Post GC: GPU:0
process 32095 uses 4141.000 MB GPU memory

Is the fact that there are ten diseases and ten rice varieties (making 20 outputs) just a coincidence?

Sorry I am late to the lesson. I came down with COVID last Tuesday, from an outdoor unmasked party. :face_with_thermometer:

1 Like

When running the large model training from the command line, the CUDA Memory Errors were clearing back to zero MB. So that seemed the way to go. The threading and multiprocessing librarys tempted me to try them first, since they facilitated nice interprocess communication with queues, but these ultimately failed since they weren’t really separate memory spaces.

Finally suceeded using the subprocess library – manged to be run successive memory sizing tests from a notebook without a kernel reset being forced my the CUDA ERROR memory leak.

Its hacky and a bit fragile, but paste the following code into one cell, and you “should” be able to convert any train() call into an xtrain() call. YMMV.

import inspect, subprocess

def xtrain(arch, aug_size, **kwargs):
    train_src = inspect.getsource(train)
    ext_src = f'''
import fastai, sys, gc, torch 
from fastai.vision.all import *

trn_path='{trn_path}'

{train_src}
print('======== TRAINING IN EXTERNAL PROCESS ========')
print('trn_path = ', trn_path )
print('arch = {arch}' )
print('aug_size = {aug_size}' )
print( {kwargs} )
print('==============================================')

stat=0  # no error
try: 
    train('{arch}', {aug_size}, **{kwargs})
except Exception as e:
    print(e)
    if repr(e).find("CUDA out of memory") > 0:
        stat=1  # CUDA oom error
    else:
        stat=2  # Other error

sys.exit(stat)
'''

    result = subprocess.run(["python","-c", ext_src ])
    if result.returncode == 0: 
        print('TRAINING COMPLETED SUCCESSFULLY')
    elif result.returncode == 1:
        print('CUDA OUT OF MEMORY ERROR IN EXTERNAL PROCESS')
    else:
        print('OTHER ERROR')
    report_gpu()
    
# xtrain('convnext_small_in22k', 128, epochs=1, accum=1, finetune=False)
# xtrain('swinv2_large_window12_192_22k', 192, epochs=1, accum=1, finetune=False)
1 Like

As you can see in the copy I made of Jeremy’s Multi-target: Road to the Top, Part 4, after training the multi-target model to predict disease and variety, calling get_preds would return “raw” outputs and not probabilities. Both the probs and decoded are the same “raw” values:

image

The error_rate function returned the same value that was reported in the training.

With what we learnt in this lesson, I now know that if we wanted the probabilities, we would calculate the softmax for those raw values. And applying argmax would give us the “normal” decoded values.

But what I was wondering if this is the expected output, I was expecting the probabilities and decoded values.

Am I missing something in what I’ve done or expected from get_preds?

Thanks a lot.