Lesson 5 Advanced Discussion ✅

This paper also lists a believable 0.89 RMSE (MSE of 0.794) for the movielens 100k dataset.

https://arxiv.org/abs/1807.01798

Weight decay

Hello,
Andrew NG, in his course of Machine learning, mentioned that we do not do weight decay to biases, so loss function should only include weight matrices not the bias.
Is what model.parameters returns, does it return the bias as well, so we are doing weight decay with all parameters including bias?

Also, why we would need to decay the bias?

Regards
Ibrahim

7 Likes

I will like to know more about the 1/3 multiplier by default on the differential learning rates please :slight_smile: .

1 Like

Wondering how embeddings for unseen data was computed, I dig into fastai library. For each column, fastai adds an extra category for unknown values, with its corresponding embedding vector. However, it is not clear to me how the weights of the embedding vector for the unknown category are computed. I would expect it to be the mean or the median of the others embeddings vectors, but it is not. Visualizing the weights of the embedding vector before and after training, weights slightly changes, but basically remain close to the random initialized weights.

Since it doesn’t seem to be computed as the mean/median of the other vectors after the training, I wonder:

  1. Does it make sense to leave it with the (close to) random initialized values?
  2. Why the weights slightly changes if there is no unknown data in the training set? How can something backprop (and update) to an unseen category?

This is the code than I run based on lesson4-tabular:

06

I’ll appreciate some clarification to better understand it. Thanks!

3 Likes

I’m not really aware of anything much beyond fully connected nets. Let me know if anyone here has come across anything interesting!

1 Like

It’s better to use create_cnn, so that fastai will create a version you can use for transfer learning for your problem.

4 Likes

I think they’re actually reporting MSE, based on their SVD++ number.

1 Like

Following my previous post about the embedding vectors for unseen data, after training a tabular model, I manually set the weights of the embedding vectors of the unseen data equal to the mean of the others embedding vectors, achieving an increase in the accuracy (I don’t think it is just by chance).

I’d like to get some feedback, specially if any of you try it on your tabular model (specially with unseen categories in the valid set). It is just one line of code to update the weights of the embedding.

01

6 Likes

I think that’s a great idea! :slight_smile:

5 Likes

What I’d like to try next is to apply a kind of mix between dropout and data augmentation for tabular data. Currently the embedding vector for unseen data seems not being trained… What if categorical variables are set randomly to 0 (unknown) during training? I guess this would help the model to learn to predict even with some unseen categories. I think this is different from regular dropout in three ways:

  1. Unlike embedding dropout that randomly drops the output of the embedding layer, the idea here is to randomly drop input values from categorical features, forcing the model to train using the embedding vector for unseen data instead.
  2. For each categorical variable, add a column to flag when the value of the category is unknown, either by “data augmentation/dropout” during training or actual unknown categories during validation/inference. The idea is that the model receives information when a value for the category is unknown, on top of the embedding vector for unseen data (similarly to “_na” fields when filling missing values).
  3. The “p” value (probability) of dropping out a categorical variable, would be eventually link to the probability of that variable being unknown in the validation set (it’d probably worth to train the model better to deal with categorical variables that are often missed in the validation set). Like Jeremy did when training language model, I hope this is not cheating using information from the features of the validation set for training.

I will try to implement it and will share the results.

4 Likes

Hi José,
I think this is a very interesting idea, and look forward to seeing your results.

I have a question though: how different would this be to applying dropout to the input layer?
If I understand correctly what you are proposing is to kind of apply dropout to inputs to the embedding layer, to make those embeddings more robust. You might be interested in comparing 3 approaches: 1) no dropout applied, 2) apply dropout to all inputs to the NN, or 3) apply dropout to categorical data only. Just a thought.

1 Like

Thank you Ignacio for your feedback.
As you mentioned it is a kind of dropout to inputs, however some tweaks were required to make it work with categorical variables, including:

  1. Applying regular dropout to a categorical variable, I think it’s kind of too aggressive, since it is not just making zero some inputs, but changing the category itself, and so using completely different values from the embedding layer. In order to mitigate it, I include a flag for each category indicating whether the category is unknown or not, so the model is receiving extra information to deal with it.
  2. Dropout was applied discriminatively by variable, based on the % of unknown values in the validation set, to put more effort on variables that are often unknown.
  3. Regular pythorch dropout seems not to work with integer values, so a custom dropout function was required to make it work with embeddings.

The final result seems slightly better than training the regular model. However, it is not clear if it is just due to the training stochasticity. My toy dataset is quite simple and doesn’t have significant unknown values. Once Jeremy introduces Rossmann dataset in class, I will benchmark on it, anyway I am not aiming to invent anything, but taking these experiments as a way to learn and practice. :slight_smile:

1 Like

Thanks a lot for the detailed response, José!
I think it’s a very creative idea.
It’ll be interesting to see the results you get.
Un saludo!

1 Like

Very sensible. We already use category code 0 for #na, so you can actually do this right now! :slight_smile:

1 Like

I’m making model where accuracy is top_3_accuracy. So i’m taking top 3 predictions as a result and if one of those is correct then it is same as predicting correct accuracy metric. Currently my predictions are like [0.02,0.04,0.93,0.01] because of cross entropy. How I can do this for three numbers. Like three bigger numbers and others less. I think that might make model better.

You can try:

pred_top3 = pred.topk(k=3, dim=1)[1]
acc = torch.stack([(p==y).any() for p,y in zip(pred_top3, y)]).float().mean()

Hope this helps.

3 Likes

Hello! I tried your code, but it fails with error:
TypeError: conv2d(): argument 'input' (position 1) must be Tensor, not bool

I have the latest fastai 1.0.28

1 Like

Sounds like a good research topic to me, especially since we seem to have established that tabular models could use more research. Thank you for sharing this meaningful contribution!

1 Like

Sorry to chime in late. Don’t you also lose what part of the feature set is lost. For instance if you reduce 10 features to 2 - you don’t know which of the 10 were dropped right ?

Actually @pbanavara you do know; PCA (principal components analysis) forms a set of new features that are independent linear combinations of the original input features, and keeps only those new features that have the strongest effect. The idea is, you gain a simpler model at the expense of some information loss. If you want to, you can find out exactly which new features have been dropped. However, in most cases it would be difficult to interpret these dropped features.