Yes, just pip install. No need for VS but you will need to pip install any dependencies first along with cuda if you are using a gpu.

Besides regularization affect, data augmentation also makes the CNN more robust by increasing its translation and scale invariance,

Yes, bias is also a “learned” parameter that is generally initialized as 0 or a small number in case of relu activations. Generally not randomly initialized as the random initialization of the weights suffices.

I had a scenario for tabular data where all the features were continuous. So what feature engineering I can do to prepare such datasets.

@jeremy Could you turn Lesson 6 official resources and updates into wiki so others can contribute or edit? Thanks.

Yes. Because we simply create and use pandas categorical variables, you can manually cast a column, including passing `ordered=True`

.

https://pandas.pydata.org/pandas-docs/stable/categorical.html

You shouldn’t ever (AFAICT) need or want to use `get_dummies`

, since that does 1-hot encoding, which is the thing that embeddings allows you to avoid (remember, they are just a computational shortcut for that).

It is not different - if you check the code, you’ll see that we’re simply using `nn.Dropout`

to create that layer.

Great question. In general, no, these particular augmentation methods are unlikely to work well for (say) spectograms. E.g. it doesn’t make sense to flip a spectogram or use perspective warping. So you’ll need to come up with your own domain-specific augmentation methods. It’s easy to create new augmentation functions - but we won’t cover that until part 2, so in the meantime have a look at the implementation of (e.g.) `brightness`

or `flip`

in fastai.

There are only two types of numbers in a neural net: activations (stuff that’s calculated by a layer) and parameters (stuff that’s learned through gradient descent). Since bias isn’t calculated, you know it must be a parameter, so it’s being learned.

Would that just be in the case of time ordinal data since we have a specific month order or any groups - such as a column that has groups of symptoms for cancer patients? Wouldn’t it then make sense to expand to show whether or not a certain category is present or not?

Recall that a kernel is a rank-4 tensor (`n*c*h*w`

). Each one creates `n`

features. So each feature is created from a `c*h*w`

rank-3 tensor. For instance, if there are 3 input channels, and the kernel size is 3, then that’s a `3*3*3`

tensor. A convolution at each position will do an elementwise multiplication of each of those 27 pixels with the corresponding kernel locations, and sum them up, to create a single number.

I wanted the kernel mean to be 1.0, so it didn’t make the picture lighter or darker overall.

No. Check the source and/or docs, and you’ll see it simply calls whatever function you ask for:

```
class Lambda(nn.Module):
"An easy way to create a pytorch layer for a simple `func`."
def __init__(self, func:LambdaFunc):
"create a layer that simply calls `func` with `x`"
super().__init__()
self.func=func
def forward(self, x): return self.func(x)
```

Let’s say there’s a feature that finds how fluffy the texture is in one part of the image. Then the average pooling averages that over the whole image, to say “how fluffy is the thing in this image, on average?”

There might be situations where that is helpful, although often dropping a whole input column could remove the main thing that lets you make good predictions. You can also try adding dropout as your first layer, and see how it goes.

Is this reasoning about dropout and L2 regularization correct?

**I did ask this during class, I guess its was too long to be answered right away.**

L2 regularization affects all the parameters in all the activation units per epoch, whereas dropout affects only some of the units per epoch.

If we are too aggressive with L2 we may need less epochs (assuming we avoid gradient steps too large to oscillate).

But for dropout to work we have to run over longer period (larger epoch number) for it to work. That’s the only way we get greater coverage.

Bottom line dropout can be more effective for longer epoch, whereas L2 can be more effective for shorter epochs.

Hello!

@jeremy told the following statement in the Lesson 6 about batch norm momentum and regularization effect from changing momentum’s value (53:17 in the video):

“If you use a smaller number, it means that the mean and standard deviation will vary less from mini batch to mini batch and thus will have less regularization effect, a larger number will mean the variation will be greater from mini batch to mini batch that will have more regularization effect.”

I found that in PyTorch there is a difference in using momentum argument in formula (comparing with TensorFlow):

https://pytorch.org/docs/stable/nn.html#torch.nn.BatchNorm1d

So the current implementation of the running statistics as I understand is the following (corresponding to the first link to official documentation):

x_new = (1 - momentum) * x_est + momentum * x_t,

where x_est - estimated statistic (on the previous update step), x_t - new observed value.

So if momentum is small, than (1 - momentum) is large and we have more smoothed values of running statistics (“mean and standard deviation will vary less”), what I don’t understand why “that will have less regularization effect” as Jeremy told? I thought that in more smoothed version we will have more regularization effect, not less, and vice versa for large momentum value.

Could you please explain where I am wrong?

Thank you!

No that doesn’t really make sense to me.