Lesson 10 Discussion & Wiki (2019)

neuradai · April 4, 2019, 3:34am

momentum

zachcaceres · April 4, 2019, 3:34am

eps is epsilon (a tiny value)

sgugger · April 4, 2019, 3:35am

mom is the parameter we use in the moving average. It kind of looks like the momentum in SGD, which is why we named it this way.

simonjhb · April 4, 2019, 3:38am

Epsilon in the denominator is there to ensure we don’t divide by zero if our batch standard deviation happens to be zero.

mediocrates · April 4, 2019, 3:40am

So t.lerp_(m, mom) means t * (1 - mom) + m * (mom)? Or the other way around?

PierreO · April 4, 2019, 3:41am

I dont really get bias= not bn. Could someone explain it?

EDIT: got it, sorry, I was thinking about the wrong thing

skottapa · April 4, 2019, 3:41am

Can’t help but wonder if mom should also be a learned parameter initialized at 0.9 though.

codeamt · April 4, 2019, 3:42am

(1) so we gain access to lerp via creating a buffer and (2) same general thought process with swa callback?

simonjhb · April 4, 2019, 3:43am

not bn returns True when bn is not defined. So when you create conv layer without batch norm you get bias but when it has batch norm, no bias.

sgugger · April 4, 2019, 3:44am

I never remember. Try it in a notebook and check!

rachel · April 4, 2019, 3:46am

All the norm papers that will be mentioned are linked at top in the wiki.

mediocrates · April 4, 2019, 3:46am

Ha, I was just doing that.

>>> t = torch.Tensor([1, 2, 3])
>>> m = torch.Tensor([0, 0, 0])
>>> t.lerp(m, 0.8)
tensor([0.2000, 0.4000, 0.6000])

Not to be a contrarian, but this makes more sense to me than the other way around, at least in the way that I’m thinking about it – a higher momentum means you move more strongly towards the new thing (in this case, m).

sgugger · April 4, 2019, 3:47am

In moving averages usually the higher the momentum, the lower you move toward the new thing.

mediocrates · April 4, 2019, 3:48am

I’m unable to run this latest cell, btw. Anyone else getting NameError: name 'ScriptModule' is not defined? I must be missing something here

class LayerNorm(ScriptModule):

neuradai · April 4, 2019, 3:48am

Perhaps “inertia” would be a better name for moving averages, then

sgugger · April 4, 2019, 3:49am

You’re missing jit and the import that go with it (probably a cell up)

rachel · April 4, 2019, 3:50am

if you think of momentum as the effect that your previous direction has on your current direction, then larger momentum would mean a bigger effect. Not using momentum means not taking that past info into account.

mediocrates · April 4, 2019, 3:51am

That’s very clear, thanks!

devforfu · April 4, 2019, 3:52am

You could also read about discounted rewards in Reinforcement Learning. It uses a similar concept. The discounting rate (momentum) helps to balance historical experience with new observations. Also, I guess it could be related to filtering from the signal theory and things like Kalman filter.

tanyaroosta · April 4, 2019, 3:53am

all of these batch norm variations are for images and convnets. What if you are training non image data, maybe text? What would be the channel dimension there?