Lesson 8 (2019) discussion & wiki

How do we come up with the number of hidden layers and dimensions of the hidden layers.

2 Likes

Now you want to use different kinds of data augmentation on your training and validation set usually. Jeremy was talking about the normalization.

Trial and error.

A couple of notes about @ and matrix multiplication notation in Python:

  1. https://legacy.python.org/dev/peps/pep-0465/
  2. https://docs.python.org/3/library/operator.html#operator.__matmul__
1 Like

In practice, you usually want to start with an existing architecture that is known to work well (and then tweak it to fit your needs).

1 Like

Jeremy just said Kaiming but he meant Xavier. What he just explained is Xavier initialization I believe. Kaiming is when you account for ReLUs.

11 Likes

sorry i dint get youā€¦

  1. My intent was to ask if my both train/valid will get normalized in same manner or not
  2. I presume other aug should be done in same way as each ?
  3. should we pass _ for valid or same tfm which we apply to train ?

In normalize, you have only one set of stats, so it will be used for the two sets. Data augmentation is a regularization technique, itā€™s different. You want to apply it to your training set, but not your validation set.

2 Likes

Why we need our data to have mean 0 and std 1 ?

2 Likes

That normal distribution isnā€™t normalā€¦ :wink:

Fixup Initialization paper link

1 Like

Links to the papers with Kaiming and Xavier initialization are at the top.

2 Likes

Because models like it. Making life easier for your model is the best way to get good results.
Also, if your activations arenā€™t at scale 1, with 50 layers or deeper they are going to all become zeros (is scale < 1) or nan (if scale > 1)

8 Likes

Is Relu known to be better than sigmoid because it is proven empirically or is there a intuitive reason why one non linear layer is better than the other ?

1 Like

Do we have an intuition of why the models like it ? Apart from the vanishing weights issue.

5 Likes

Itā€™s mostly because itā€™s the faster non-linearity to implement. As long as you account for your activations in you initialization (like Jeremy is explaining for Kaiming right now) the actual choice of non-linearity isnā€™t that important.

8 Likes

Will the items Jeremy keeps bring up as ā€œdo this for homeworkā€ be documented somewhere?

Not sure I got all of the items in my notes!

2 Likes

Iā€™ve lost track of why we deleted all the negative values in the first place.

Well thatā€™s actually the main reason, vanishing or exploding weights.

3 Likes

We need something to make our function non-linear. Deleting the negatives (ReLU) is a relatively simple and effective way to do this.

7 Likes