How do we come up with the number of hidden layers and dimensions of the hidden layers.
Now you want to use different kinds of data augmentation on your training and validation set usually. Jeremy was talking about the normalization.
Trial and error.
A couple of notes about @
and matrix multiplication notation in Python:
In practice, you usually want to start with an existing architecture that is known to work well (and then tweak it to fit your needs).
Jeremy just said Kaiming but he meant Xavier. What he just explained is Xavier initialization I believe. Kaiming is when you account for ReLUs.
sorry i dint get youā¦
- My intent was to ask if my both train/valid will get normalized in same manner or not
- I presume other aug should be done in same way as each ?
- should we pass _ for valid or same tfm which we apply to train ?
In normalize, you have only one set of stats, so it will be used for the two sets. Data augmentation is a regularization technique, itās different. You want to apply it to your training set, but not your validation set.
Why we need our data to have mean 0 and std 1 ?
That normal distribution isnāt normalā¦
Links to the papers with Kaiming and Xavier initialization are at the top.
Because models like it. Making life easier for your model is the best way to get good results.
Also, if your activations arenāt at scale 1, with 50 layers or deeper they are going to all become zeros (is scale < 1) or nan (if scale > 1)
Is Relu known to be better than sigmoid because it is proven empirically or is there a intuitive reason why one non linear layer is better than the other ?
Do we have an intuition of why the models like it ? Apart from the vanishing weights issue.
Itās mostly because itās the faster non-linearity to implement. As long as you account for your activations in you initialization (like Jeremy is explaining for Kaiming right now) the actual choice of non-linearity isnāt that important.
Will the items Jeremy keeps bring up as ādo this for homeworkā be documented somewhere?
Not sure I got all of the items in my notes!
Iāve lost track of why we deleted all the negative values in the first place.
Well thatās actually the main reason, vanishing or exploding weights.
We need something to make our function non-linear. Deleting the negatives (ReLU) is a relatively simple and effective way to do this.