Lesson 10 Discussion & Wiki (2019)

momentum

1 Like

eps is epsilon (a tiny value)

1 Like

mom is the parameter we use in the moving average. It kind of looks like the momentum in SGD, which is why we named it this way.

2 Likes

Epsilon in the denominator is there to ensure we don’t divide by zero if our batch standard deviation happens to be zero.

5 Likes

So t.lerp_(m, mom) means t * (1 - mom) + m * (mom)? Or the other way around?

I dont really get bias= not bn. Could someone explain it?

EDIT: got it, sorry, I was thinking about the wrong thing

Can’t help but wonder if mom should also be a learned parameter initialized at 0.9 though.

1 Like

(1) so we gain access to lerp via creating a buffer and (2) same general thought process with swa callback?

not bn returns True when bn is not defined. So when you create conv layer without batch norm you get bias but when it has batch norm, no bias.

3 Likes

I never remember. Try it in a notebook and check!

All the norm papers that will be mentioned are linked at top in the wiki.

2 Likes

Ha, I was just doing that.

>>> t = torch.Tensor([1, 2, 3])
>>> m = torch.Tensor([0, 0, 0])
>>> t.lerp(m, 0.8)
tensor([0.2000, 0.4000, 0.6000])

Not to be a contrarian, but this makes more sense to me than the other way around, at least in the way that I’m thinking about it – a higher momentum means you move more strongly towards the new thing (in this case, m).

2 Likes

In moving averages usually the higher the momentum, the lower you move toward the new thing.

1 Like

I’m unable to run this latest cell, btw. Anyone else getting NameError: name 'ScriptModule' is not defined? I must be missing something here

class LayerNorm(ScriptModule):

Perhaps “inertia” would be a better name for moving averages, then

2 Likes

You’re missing jit and the import that go with it (probably a cell up)

if you think of momentum as the effect that your previous direction has on your current direction, then larger momentum would mean a bigger effect. Not using momentum means not taking that past info into account.

11 Likes

That’s very clear, thanks!

You could also read about discounted rewards in Reinforcement Learning. It uses a similar concept. The discounting rate (momentum) helps to balance historical experience with new observations. Also, I guess it could be related to filtering from the signal theory and things like Kalman filter.

1 Like

all of these batch norm variations are for images and convnets. What if you are training non image data, maybe text? What would be the channel dimension there?

2 Likes