Lesson 10 Discussion & Wiki (2019)

When you need to run some code to clean up after the code inside is finished. Like closing the file properly after reading lines.

You should use them whenever you have something “temporary” AND you don’t want to forget that do something at the end, e.g. opening a file (and then not forgetting to close it), registering a hook and then not forgetting to deregister it…

1 Like

Any imagenet model that uses Leakyrelu ?

Wasting information as in the result of the calculation doesn’t change any of the weights? Or is the resulting calculation just gives us an activation full of 0s?

Two situations:

  • you want to make sure something is closed when you’re finished (the file you’re reading, your hooks are removed)
  • you are using a temporary substitute for a parameter: for instance you want to change your loss function for a bit, so you want it to change when you enter the context manager and to be put back to its origin when you exit.
4 Likes

By the way, there was a discussion about context managers some time ago.

4 Likes

Do you have a comparison of how the model performed with standard relu vs shifted relu?

1 Like

The first one. A neural net is cramming information into a tiny state. Having too many of those be 0 is a real waste.

1 Like

Off topic but when Jeremy said a thousand layers deep I just can’t help https://www.youtube.com/watch?v=46cSksKVzzs

1 Like

Do people use kaiming normalization to initialize the “mults” for batch normalization?

1 Like

No that’s not what Kaiming init is for: Kaiming is to use the right scale for linear or conv layers.

1 Like

What do mom and eps stand for?

momentum

1 Like

eps is epsilon (a tiny value)

1 Like

mom is the parameter we use in the moving average. It kind of looks like the momentum in SGD, which is why we named it this way.

2 Likes

Epsilon in the denominator is there to ensure we don’t divide by zero if our batch standard deviation happens to be zero.

5 Likes

So t.lerp_(m, mom) means t * (1 - mom) + m * (mom)? Or the other way around?

I dont really get bias= not bn. Could someone explain it?

EDIT: got it, sorry, I was thinking about the wrong thing

Can’t help but wonder if mom should also be a learned parameter initialized at 0.9 though.

1 Like

(1) so we gain access to lerp via creating a buffer and (2) same general thought process with swa callback?