Lesson 8 (2019) discussion & wiki

karthik.subraveti · March 19, 2019, 3:18am

How do we come up with the number of hidden layers and dimensions of the hidden layers.

sgugger · March 19, 2019, 3:19am

Now you want to use different kinds of data augmentation on your training and validation set usually. Jeremy was talking about the normalization.

sgugger · March 19, 2019, 3:19am

Trial and error.

devforfu · March 19, 2019, 3:20am

A couple of notes about @ and matrix multiplication notation in Python:

rachel · March 19, 2019, 3:20am

In practice, you usually want to start with an existing architecture that is known to work well (and then tweak it to fit your needs).

sgugger · March 19, 2019, 3:20am

Jeremy just said Kaiming but he meant Xavier. What he just explained is Xavier initialization I believe. Kaiming is when you account for ReLUs.

champs.jaideep · March 19, 2019, 3:21am

sorry i dint get you…

My intent was to ask if my both train/valid will get normalized in same manner or not
I presume other aug should be done in same way as each ?
should we pass _ for valid or same tfm which we apply to train ?

sgugger · March 19, 2019, 3:24am

In normalize, you have only one set of stats, so it will be used for the two sets. Data augmentation is a regularization technique, it’s different. You want to apply it to your training set, but not your validation set.

Shubhajit · March 19, 2019, 3:26am

Why we need our data to have mean 0 and std 1 ?

PierreO · March 19, 2019, 3:26am

That normal distribution isn’t normal…

SHAR1 · March 19, 2019, 3:26am

Fixup Initialization paper link

rachel · March 19, 2019, 3:27am

Links to the papers with Kaiming and Xavier initialization are at the top.

sgugger · March 19, 2019, 3:27am

Because models like it. Making life easier for your model is the best way to get good results.
Also, if your activations aren’t at scale 1, with 50 layers or deeper they are going to all become zeros (is scale < 1) or nan (if scale > 1)

karthik.subraveti · March 19, 2019, 3:29am

Is Relu known to be better than sigmoid because it is proven empirically or is there a intuitive reason why one non linear layer is better than the other ?

PierreO · March 19, 2019, 3:30am

Do we have an intuition of why the models like it ? Apart from the vanishing weights issue.

sgugger · March 19, 2019, 3:31am

It’s mostly because it’s the faster non-linearity to implement. As long as you account for your activations in you initialization (like Jeremy is explaining for Kaiming right now) the actual choice of non-linearity isn’t that important.

magiclantern · March 19, 2019, 3:31am

Will the items Jeremy keeps bring up as “do this for homework” be documented somewhere?

Not sure I got all of the items in my notes!

ThomM · March 19, 2019, 3:31am

I’ve lost track of why we deleted all the negative values in the first place.

sgugger · March 19, 2019, 3:31am

Well that’s actually the main reason, vanishing or exploding weights.

rachel · March 19, 2019, 3:32am

We need something to make our function non-linear. Deleting the negatives (ReLU) is a relatively simple and effective way to do this.