Lesson 6 In-Class Discussion ✅

It is not different - if you check the code, you’ll see that we’re simply using nn.Dropout to create that layer.

Great question. In general, no, these particular augmentation methods are unlikely to work well for (say) spectograms. E.g. it doesn’t make sense to flip a spectogram or use perspective warping. So you’ll need to come up with your own domain-specific augmentation methods. It’s easy to create new augmentation functions - but we won’t cover that until part 2, so in the meantime have a look at the implementation of (e.g.) brightness or flip in fastai.

6 Likes

There are only two types of numbers in a neural net: activations (stuff that’s calculated by a layer) and parameters (stuff that’s learned through gradient descent). Since bias isn’t calculated, you know it must be a parameter, so it’s being learned.

2 Likes

Would that just be in the case of time ordinal data since we have a specific month order or any groups - such as a column that has groups of symptoms for cancer patients? Wouldn’t it then make sense to expand to show whether or not a certain category is present or not?

Recall that a kernel is a rank-4 tensor (n*c*h*w). Each one creates n features. So each feature is created from a c*h*w rank-3 tensor. For instance, if there are 3 input channels, and the kernel size is 3, then that’s a 3*3*3 tensor. A convolution at each position will do an elementwise multiplication of each of those 27 pixels with the corresponding kernel locations, and sum them up, to create a single number.

1 Like

I wanted the kernel mean to be 1.0, so it didn’t make the picture lighter or darker overall.

1 Like

No. Check the source and/or docs, and you’ll see it simply calls whatever function you ask for:

class Lambda(nn.Module):
    "An easy way to create a pytorch layer for a simple `func`."
    def __init__(self, func:LambdaFunc):
        "create a layer that simply calls `func` with `x`"
        super().__init__()
        self.func=func

    def forward(self, x): return self.func(x)
1 Like

Let’s say there’s a feature that finds how fluffy the texture is in one part of the image. Then the average pooling averages that over the whole image, to say “how fluffy is the thing in this image, on average?”

4 Likes

There might be situations where that is helpful, although often dropping a whole input column could remove the main thing that lets you make good predictions. You can also try adding dropout as your first layer, and see how it goes.

1 Like

Is this reasoning about dropout and L2 regularization correct?
I did ask this during class, I guess its was too long to be answered right away.

L2 regularization affects all the parameters in all the activation units per epoch, whereas dropout affects only some of the units per epoch.
If we are too aggressive with L2 we may need less epochs (assuming we avoid gradient steps too large to oscillate).
But for dropout to work we have to run over longer period (larger epoch number) for it to work. That’s the only way we get greater coverage.

Bottom line dropout can be more effective for longer epoch, whereas L2 can be more effective for shorter epochs.

Hello!

@jeremy told the following statement in the Lesson 6 about batch norm momentum and regularization effect from changing momentum’s value (53:17 in the video):
“If you use a smaller number, it means that the mean and standard deviation will vary less from mini batch to mini batch and thus will have less regularization effect, a larger number will mean the variation will be greater from mini batch to mini batch that will have more regularization effect.”

I found that in PyTorch there is a difference in using momentum argument in formula (comparing with TensorFlow):
https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d


So the current implementation of the running statistics as I understand is the following (corresponding to the first link to official documentation):
x_new = (1 - momentum) * x_est + momentum * x_t,
where x_est - estimated statistic (on the previous update step), x_t - new observed value.

So if momentum is small, than (1 - momentum) is large and we have more smoothed values of running statistics (“mean and standard deviation will vary less”), what I don’t understand why “that will have less regularization effect” as Jeremy told? I thought that in more smoothed version we will have more regularization effect, not less, and vice versa for large momentum value.

Could you please explain where I am wrong?

Thank you!

1 Like

No that doesn’t really make sense to me.

More randomness means more regularization effect. More smoothing means less randomness.

Regarding tabular learner:
It looks like tabular_learner.predict() expects only a row from a dataframe one at a time, how can I provide the full valiadation dataframe?
Is there a way to add auc as a metric to tabular learner in the case of binary classification?

2 Likes

Thank you @jeremy,

I think I am confused by comparing with the simple version of regularization, when we have simple regression and with more regularization we will have more smoothed curve. So these 2 cases are still contradictory in my mind :slight_smile:
Could someone understand what is the difference between these 2 cases, why smoothing in simple regression results in more regularization, but for example for batch norm smoothing we have less regularization?
It seems intuitive that more randomness will regularize more, but after remembering the case with simple regression I have misunderstanding what is the difference here.

Update:
I think I found the answer - “By adding Guassian noise to the input, the learning model will behave like an L2-penalty regularizer.”, so now it is consistent for me.
Thank you!

I am building a tabular model similar to Rossmann that has two different multi-categorical features that can have up to 10 different values from a list of 20000 possible values (like the satellite photo competition, but many more values). Think of a video or article with multiple topics or keywords.

I could use 10 columns for each type of feature and just encode them with each included category’s key, but I would like to use embeddings.

What is the best way to represent this type of multi-category feature and use embeddings?

1 Like

Watching Jeremy’s introduction to dropout reminded me of this paper (PDF):

Forgetfulness is a feature, not a bug!

7 Likes

In some rare cases augmentation with one epoch could make sense, if variation of availabe images is very low (almost repeated images everywhere). Then it would be image synthesis through augmentation, and could make sense even with 1 epoch, edge cases but possible.

2 Likes

Hello,

I’m very confused with something in the Rossman example. Particularly, in this line of code:

learn = tabular_learner(data, layers=[1000,500], ps=[0.001,0.01], emb_drop=0.04,
y_range=y_range, metrics=exp_rmspe)

Here “layers” is an intermediate layer of 500,000 parameters. Jeremy said that there are too many parameters that would likely overfit our data. But we have over 1,000,000 samples. So how do you know from before hand how many parameters are likely to overfit your data?

So is batch size the same as number of features?