Lesson 7 official topic

How do we retrieve the mean and std vectors for normalization associated with pre-trained models?

I realize that fastai applies the normalization for training and inference automatically, but if I export the model (eg, to iOS), I need to apply the normalization manually to the data prior to feeding it to the model for inference. Do all pre-trained models use the imagenet normalization vectors? How about pre-trained models provided by timm

Thanks!

2 Likes

Most use imagenet stats, but not all. You’ll find learner.normalization contains the callback object, and the stats will be in there.

2 Likes

Thank you. I must be doing something wrong. I get this: AttributeError: 'XResNet' object has no attribute 'normalization'

from fastai.vision.all import *
path = untar_data(URLs.IMAGENETTE)
dblock = DataBlock(blocks=(ImageBlock(), CategoryBlock()),
                   get_items=get_image_files,
                   get_y=parent_label,
                   item_tfms=Resize(460),
                   batch_tfms=aug_transforms(size=224, min_scale=0.75))
dls = dblock.dataloaders(path, bs=64)
learn = Learner(dls, model, loss_func=CrossEntropyLossFlat(), metrics=accuracy)
learn.normalization

Ah that’s because you didn’t use vision_learner, so you didn’t get normalization added automatically.

1 Like

Thank you. I changed Learner to vision_learner as suggested:
learn = vision_learner(dls, xresnet50, loss_func=CrossEntropyLossFlat(), metrics=accuracy)

I now get this error: 'Sequential' object has no attribute 'normalization' in response to learn.normalization

1 Like

I was curious to have a go at extending the example “RTTT Part 4 multi-target notebook” to include Age. After throwing lots of darts in the dark, I’ve now cleared the runtime errors and am looking for some feedback on what I’ve cobbled together. Here is the notebook.

I added get_age from basic pattern matching, and took a wild guess at using RegressionBlock

def get_variety(p): return df.loc[p.name, 'variety']
def get_age(p):     return df.loc[p.name, 'age']
...
dblock = DataBlock(
    blocks=(ImageBlock,CategoryBlock,CategoryBlock,     RegressionBlock),
    n_inp=1,
    get_items=get_image_files,
    get_y = [parent_label,get_variety,    get_age],

That seemed to work, with show_batch displaying the age…
image

From basic pattern matching I extended loss and error functions to three parameters. It was a struggle to clear runtime errors until I discovered rmse() in the Regression Metrics to use instead of error(). But I’m not sure if I need to scale age?

I took a guess that only a single addition float output (zero offset element 20).

def disease_err(inp,disease,variety, age): return error_rate(inp[:,:10],disease)
def variety_err(inp,disease,variety, age): return error_rate(inp[:,10:20],variety)
def age_err(inp,disease,variety,age):      return rmse(inp[:,20],age)
err_metrics = (disease_err, variety_err, age_err)

The gist here was using F.l1_loss for age regression, so I copied that…

def disease_loss(inp,disease,variety,age): return F.cross_entropy(inp[:,:10],disease)
def variety_loss(inp,disease,variety,age): return F.cross_entropy(inp[:,10:20],variety)
def age_loss(inp,disease,variety,age): return F.l1_loss(inp[:,20],age)
def combine_loss(inp,disease,variety,age): return disease_loss(inp,disease,variety,age) + variety_loss(inp,disease,variety,age) + age_loss(inp,disease,variety,age)

So here is the result…

I notice that while age_err improves a lot, disease_err and variety_err don’t improve much, and certainly much less than without age. So I’m guessing the age_loss is swamping the others and needs to be scaled. Does that scaling need to be done in both loss and error functions? I can’t try this until tomorrow, so hints are welcome to reduce my time experimenting then…

1 Like

Hi Fast.ai team,

Thanks a lot for a great lesson on Collab Filtering.

I just have a quick question: when first heard about this algo (Collab Filtering), I supposed that the algo would make use of data from other users in predictions.

But as mentioned in this lesson, for both the Dot-product & Deep learning methods: the training data only make use of data from each user individually, i.e., (user X, rating for movie Y) and optimize the loss.

So my question is what is the “collaborative” part implied in this method’s name?

Thanks!

2 Likes

Hey folks, A quick question.
In the lesson, @jeremy says that embeddings are nothing but a way to look up into a matrix and the embedding layers help in the indexing and gradient.
Is this true for word embeddings as well and the models such as Glove and Word2Vec?

Hey,

I’m following along with the video and the collaborative deep dive notebook, but I’m running everything in my own notebook in colab. Everything has been working fine, but when I got to the line:

model = DotProduct(n_users, n_movies, 50)

I’m hitting an error:

TypeError Traceback (most recent call last)
in
----> 1 model = DotProduct(n_users, n_movies, 50)
2 learn = Learner(dls, model, loss_func=MSELossFlat())

3 frames
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py in init(self, num_embeddings, embedding_dim, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse, _weight, device, dtype)
137 self.scale_grad_by_freq = scale_grad_by_freq
138 if _weight is None:
→ 139 self.weight = Parameter(torch.empty((num_embeddings, embedding_dim), **factory_kwargs))
140 self.reset_parameters()
141 else:

TypeError: empty() received an invalid combination of arguments - got (tuple, dtype=NoneType, device=NoneType), but expected one of:

  • (tuple of ints size, *, tuple of names names, torch.memory_format memory_format, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)
  • (tuple of ints size, *, torch.memory_format memory_format, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool pin_memory, bool requires_grad)

Any ideas what I’ve done wrong?

Thanks!

Collab Filtering…
While going through the notebook once again, I noticed something or probably misinterpreted from my end here.
When we get to using deep learning for collab filtering, we make use of a simple linear layer architecture which supposedly adds weight to the actual embeddings we want to learn.
The question here now is the Embeddings we are about to learn are random themselves and the same is true for the random weights that are attached to the embeddings. This kind of make things complicated especially when I try to interprete the backprop going on. Could any one help. The code block is below

class Collab_NN(Module):
def init(self, user_sz, item_sz, y_range = (0.5, 5.5), n_act=100):
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz) #Random initialized parameters
self.y_range = y_range
self.layers = nn.sequential(
nn.Linear(user_sz[1]+item_sz[1], n_act) #random initialized parameters
nn.ReLU()
nn.Linear(n_act, 1))
def forward(self, x):
embs = self.user_factors(x[:, 0]), self.item_factors(x[:, 1])
x = self.layers(torch.cat(embs, dim=1))
return sigmoid_range(x, *self.y_range)

@deelight_del could you specify which part makes it complicated for you? You can think of an embedding as just another linear layer with an extra step so it can take indices and not vectors. When creating Embedding(*user_sz) say, this generates a user_sz[0]\timesuser_sz[1] matrix (just as nn.Linear would do). You can check this with:

embs = [(944, 74), (1635, 101)]
model = Collab_NN(*embs)
print(list(model.parameters())[0].shape)

which returns:

torch.Size([944, 74])

When making a forward pass you take the “user-part” of the x-values and pass them to self.user_factors(x[:,0]) (which is that embedding from before). But checking the size of x[:,0]:

x, y = dls.one_batch()
x[:,0].shape

shows:

torch.Size([64])

which doesn’t go together with the embedding matrixs shape. To handle this, Embedding has an auto one-hot-encoding build in which turns the index 5 say, into a 1 \times 944 vector that is 1 at position 5 and 0 else, which can be multiplied with the matrix :tada:. (If I remember correctly Jeremy mentions in the lesson that this is a matrix lookup and that torch does somehow optimize that. So in reality there is no matrix multiply but conceptually you can imagine that there is, espacially from the point of gradients).

So to summarize embeddings are almost linear layers and you can calculate their gradient as you would calculate the gradient of a linear layer :slightly_smiling_face: .
Feel free to ask on if I should explain something in more detail.

How to get same error_rate in multiple runs?

I am trying to get the train function working in a way that it will output deterministically. Below is the code from road-to-top part 3.

def train(arch, size, item=Resize(192, method='squish'), accum=1, finetune=True, epochs=12):
    dls = ImageDataLoaders.from_folder(trn_path, seed=42, valid_pct=0.5, item_tfms=item,
        batch_tfms=aug_transforms(size=size, min_scale=0.75), bs=64//accum)
    cbs = GradientAccumulation(64) if accum else []
    learn = vision_learner(dls, arch, metrics=error_rate, cbs=cbs).to_fp16()
    if finetune:
        learn.fine_tune(epochs, 0.01)
        return learn.tta(dl=dls.test_dl(tst_files))
    else:
        learn.unfreeze()
        learn.fit_one_cycle(epochs, 0.01)

Can somebody tell me what I am missing here because everytime I run it, I am getting different error rate?

[Edit: Since I missed seeing that @vettukal was already using seed.]

If all else fails your determinstic requirements, xkcd suggests a fall back…

I am already using seed=42 param in the code snipped I had posted.

1 Like

Check out set_seed and the paragraph above it. The dataloader has its own random seed separate from the main environment. The main reason for setting the dataloader seed is to ensure that the training/validation split is the same for each training run which is important for evaluating models against each other as well as ensuring your validation set is consistent. You can set the seed for the base environment in the train function at the top and that should ensure better reproducibility if that’s what you’re after. One example of something that would be different if you only set the dataloader seed is the random initialization of the linear layers that are added to the pre-trained models. Jeremy generally cautions against setting the random seed for the base environment as some seeds can produce better results which can mask your models actual general performance and as soon as you change anything such as adding another class the ‘advantage’ you stumbled upon from that initial seed is lost.


I’m not 100% sure what all the reproducible flag does under the hood in the set_seed function, but it seemed to make things run substantially slower and I was still able to get the same results with it set to False. What I think happens if you set reproducible to True is it forces your GPU to execute all operations in a purely deterministic order which prevents a lot of the clever very low level arithmetic optimizations built into cuda from being used. On the GPU a lot is happening in parallel and things such as the order in which you add floating point numbers can result in slightly different results. Ex:


I believe when you do specify things to be deterministic, operations always happen in the same order and some of the cuda tricks are not applied that make things run more quickly, but yield slightly different results. If you’re interested in what some of the tricks I’m referring to are check out some numba cuda videos about matrix multiplies and how they are optimized to run faster on the GPU (warning - this is a pretty deep and complex rabbit hole). This is probably another reason Jeremy discourages this is you incur a substantial performance penalty.

3 Likes

Whoops. I had looked for “seed” is your code sample, but seems my eyes glossed over it. Apologies for telling you how to suck-eggs. There are a whole range of experience levels on the forum. Hopefully Mat’s more detailed advice helps.

Thanks @matdmiller. set_seed(42) does work for me. Interestingly set_seed has to be called everytime training is done and not just once at the beginning.

So below code cell should also work to output reproducible output without changing the train method.

set_seed(42)
train('convnext_small_in22k', 128, epochs=2, accum=1)

Glad to help! It’s probably worth thinking about why that is that you have to call it each time you call the train function. One of the reasons is the model (learner) is reinstantiated each time you call this function. It is using pretrained weights, but not all of them are pretrained. The linear layers are automatically created and randomly initialized to work with the number of classes the learner detects from the data loader.

1 Like

I’ve been playing around with understanding DotProductBias model in the chapter.

So here’s the bog standard model, nothing new.

class DotProductBias(Module):
    def __init__(self, n_users, n_movies, n_factors, y_range=(0,5.5)):
        self.user_factors = Embedding(n_users, n_factors)
        self.user_bias = Embedding(n_users, 1)
        self.movie_factors = Embedding(n_movies, n_factors)
        self.movie_bias = Embedding(n_movies, 1)
        self.y_range = y_range
        
    def forward(self, x):
        users = self.user_factors(x[:,0])
        movies = self.movie_factors(x[:,1])
        res = (users * movies).sum(dim=1, keepdim=True)
        res += self.user_bias(x[:,0]) + self.movie_bias(x[:,1])
        return sigmoid_range(res, *self.y_range)

…trained on the data “looking” for 50 factors…

model = DotProductBias(n_users, n_movies, 50)
learn = Learner(dls, model, loss_func=MSELossFlat())
learn.fit_one_cycle(5, 5e-3)

…I then start digging into the parameters of user_factors and user_bias of the model.

model.user_factors._parameters['weight'][0] # First person's factor weights
model.user_bias._parameters['weight'][0] # First user's bias

…and then the min and max values in user_bias._parameters['weight'] (looks like between -1 and 1)

# Min bias
(tensor([-0.8956], device='cuda:0', grad_fn=<UnbindBackward0>), 

# Max bias
tensor([0.9463], device='cuda:0', grad_fn=<UnbindBackward0>))

From the book it talks about how biases are used to imitate how users are more positive or negative in their reviews. Is there a way to interpret this besides looking at examples of movies?

Is it as simple as more negative biases = more negative, larger bias more positive?
Or perhaps, smaller (or more negative bias) = more consistent ratings, larger bias = higher variance ratings.

I also understand you can use PCA with the latent factors as well to do some additional understanding.

Any help would be much appreciated! :smile:

image
image

I am facing the error ‘NVML Shared Library Not Found’ when I am running the report_gpu() function

Installed pynvml yet still problem persists. Can anybody please help?