Is the reason why activation of 0 is bad is because then the gradients will be 0 since weights are not adjusted and thus we have useless training?
The mean and std of the output of each layer can reflect the training quality?
Is it possible to make GeneralReLU
parameters learned by the network? Something like AdaptiveGeneralReLU
?
Itâs just that you are wasting information. You would be better note even computing that activation since itâs useless.
how do you decide which flavor of RELU to use?
Try it and tell us what you find
I think you can.
When should we use a context manager? How is it useful in general?
In reference to this line in the notebook âŚ
Having given an __enter__ and __exit__ method to our Hooks class, we can use it as a context manager.
When you need to run some code to clean up after the code inside is finished. Like closing the file properly after reading lines.
You should use them whenever you have something âtemporaryâ AND you donât want to forget that do something at the end, e.g. opening a file (and then not forgetting to close it), registering a hook and then not forgetting to deregister itâŚ
Any imagenet model that uses Leakyrelu ?
Wasting information as in the result of the calculation doesnât change any of the weights? Or is the resulting calculation just gives us an activation full of 0s?
Two situations:
- you want to make sure something is closed when youâre finished (the file youâre reading, your hooks are removed)
- you are using a temporary substitute for a parameter: for instance you want to change your loss function for a bit, so you want it to change when you enter the context manager and to be put back to its origin when you exit.
Do you have a comparison of how the model performed with standard relu vs shifted relu?
The first one. A neural net is cramming information into a tiny state. Having too many of those be 0 is a real waste.
Off topic but when Jeremy said a thousand layers deep I just canât help https://www.youtube.com/watch?v=46cSksKVzzs
Do people use kaiming normalization to initialize the âmultsâ for batch normalization?
No thatâs not what Kaiming init is for: Kaiming is to use the right scale for linear or conv layers.
What do mom
and eps
stand for?