Lesson 8 - Official topic

Floating point reference to Rachel’s course on Linear Algebra :slight_smile: lots of fun too. Would love an updated version of that as well :slight_smile:

1 Like

Could we somehow use regularization to try to make the RNN parameters close to the identity matrix? Or would that cause bad results because the hidden layers want to deviate from the identity during training (and thus tend to explode/vanish)?

2 Likes

is there a way to quickly check if the activations are disappearing / exploding ?

1 Like

Floating point is discussed starting around minute 54 of this video from the computational linear algebra course:

13 Likes

Thanks Rachel!!!

Check this out: The colorful dimension

4 Likes

Yes, but that could take quite sometime but if you have the compute go for it. Even Random search. I find Bayesian optimization a bit better. Or you look into this method adaptive resampling and this notebook.

How exploding/vanishing gradients work: Screenshot_2020-05-05 improving-by-1-every-day-for-a-year-1-0136537-8-0-99365-0-03-33478614 png (PNG Image, 500 × 587 pixels)

6 Likes

ActivationStats

1 Like

Original dropout paper here

2 Likes

Does dropout somehow skip the computation or just set the activation to zero?

Does “deleting an activation” mean setting it to zero?

It just sets it to zero.

4 Likes

Hinton’s intuitions of dropout (he has two reason behind it :slight_smile: ):

3 Likes

For dropouts, if the unit was set to zero during training, what weights is being used during test?

Dropout is only applied during training.

ok, but then for testing, which weights does it use if they were set to zero?

No weights were set to zero. Dropout is applied to activations.

1 Like

A general version of dropout was also proposed much earlier (Hanson 1990) but is rarely cited.

You can find many more instances of these “we or someone else did it first” claims on Schmidhuber’s blog, and here’s a rebuttal from Hinton.

EDIT: rather than adding another post I’m responding here :slight_smile: I’m inclined to agree that it’s a bit of a stretch (in addition to some of the other claims on that blog post). However, Hanson did also co-author another paper in 2018: Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning.

2 Likes

Thanks for sharing. It seems more and more that Schmidhuber and his team or other groups did everything already but that’s a discussion for another day! :wink:

2 Likes