Lesson 8 (2019) discussion & wiki

How do we come up with the number of hidden layers and dimensions of the hidden layers.


Now you want to use different kinds of data augmentation on your training and validation set usually. Jeremy was talking about the normalization.

Trial and error.

A couple of notes about @ and matrix multiplication notation in Python:

  1. https://legacy.python.org/dev/peps/pep-0465/
  2. https://docs.python.org/3/library/operator.html#operator.__matmul__
1 Like

In practice, you usually want to start with an existing architecture that is known to work well (and then tweak it to fit your needs).

1 Like

Jeremy just said Kaiming but he meant Xavier. What he just explained is Xavier initialization I believe. Kaiming is when you account for ReLUs.


sorry i dint get youā€¦

  1. My intent was to ask if my both train/valid will get normalized in same manner or not
  2. I presume other aug should be done in same way as each ?
  3. should we pass _ for valid or same tfm which we apply to train ?

In normalize, you have only one set of stats, so it will be used for the two sets. Data augmentation is a regularization technique, itā€™s different. You want to apply it to your training set, but not your validation set.


Why we need our data to have mean 0 and std 1 ?


That normal distribution isnā€™t normalā€¦ :wink:

Fixup Initialization paper link

1 Like

Links to the papers with Kaiming and Xavier initialization are at the top.


Because models like it. Making life easier for your model is the best way to get good results.
Also, if your activations arenā€™t at scale 1, with 50 layers or deeper they are going to all become zeros (is scale < 1) or nan (if scale > 1)


Is Relu known to be better than sigmoid because it is proven empirically or is there a intuitive reason why one non linear layer is better than the other ?

1 Like

Do we have an intuition of why the models like it ? Apart from the vanishing weights issue.


Itā€™s mostly because itā€™s the faster non-linearity to implement. As long as you account for your activations in you initialization (like Jeremy is explaining for Kaiming right now) the actual choice of non-linearity isnā€™t that important.


Will the items Jeremy keeps bring up as ā€œdo this for homeworkā€ be documented somewhere?

Not sure I got all of the items in my notes!


Iā€™ve lost track of why we deleted all the negative values in the first place.

Well thatā€™s actually the main reason, vanishing or exploding weights.


We need something to make our function non-linear. Deleting the negatives (ReLU) is a relatively simple and effective way to do this.