Lesson 10 Discussion & Wiki (2019)

jeremy · April 10, 2019, 9:02pm

Actually that counter is still in the current version. Although it would need to be updated even when not calculating stats.

jeremy · April 10, 2019, 9:03pm

My thought is that just recalculating stats occasionally is the best of both worlds. Especially if you recalculate every (n) batches rather than every (n) steps.

t-v · April 10, 2019, 9:39pm

Most of the time, you don’t need the clone but can just do x.detach() - that is unless you have something modifying x in place.

I must admit you lost me on what you’re trying to achieve and what the exact dimensions are of your tensors. Comon BN will take averages over bs, h, w (per channel), while LN will average over channel, h, w).
So to me running BN has three elements over regular BN 1) use an online var estimate instead of mean of vars of batches 2) use the aggregate stats always 3) scale momentum with batch size.

So I must admit that I missed where the mom1 formula with sqrt(n-1) comes from, that seems to be the most problematic part of n=1.
I just checked the video now (without sound) and there it looked like mom1 = 1-(1-mom)**bs used in the video looked roughly “as might be expected”, probably of subannual discount factors and default rates. (Even though apparently two years ago, I was more thinking of a linear adjustment of the momentum).

t-v · April 10, 2019, 9:43pm

Why is “sudden large change” better than incrementally changing with a step size that depends on batch size? But I missed the change from the geometric momentum interpolation to the division by square root of n-1, too… (Is that an erratum as well?)

stas · April 10, 2019, 9:53pm

It was changed in this commit right after the lesson was filmed, but I don’t know why.

stas · April 10, 2019, 10:30pm

stas:

The only thing I can’t figure out is how to get rid of the first 3 buffers - they no longer need to be saved in the model and can be normal vars, but if I replace them with normal vars I have the device issue CUDA vs. CPU, e.g. if I replace:
        #self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.sums = torch.zeros(1,nf,1,1)
I get:
---> 24         self.sums .lerp_(s , mom1)
     25         self.sqrs .lerp_(ss, mom1)
     26         self.count.lerp_(c , mom1)

RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for CPU_tensor_apply)
So I have to then do an explicit cuda() or to() when assigning a tensor in those vars, but I don’t know how to do it so that it’ll work transparently regardless of user’s setup. It seems that register_buffer does the right thing.

@t-v, do you by chance have an insight on how to avoid using register_buffer and yet get the temp module variables to automatically follow the device when the whole model is moved to say a specific device?

Something that would tell pytorch this variable should follow the module’s device setting, so that it doesn’t have to be manually switched - won’t work in a generic module like BN.

I looked at register_buffer implementation but it doesn’t need to do anything since it add to self._buffers which gets automatically moved to the right device with the model.

To summarize - we have temp variables that need to be on the same device as the model but we don’t want to store them in the model’s state_dict, since they are only used during training.

The only workaround I can think of, is to do:

    def __init__(self, ...):
         [...]
        #self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.sums = torch.zeros(1,nf,1,1)
    def forward(self, x):
        self.sums.to(x.device) # remains on cpu, while x is on cuda
        self.sums.lerp_(cuda vars) # fails on cuda-cpu mismatch

but it won’t cast it to cuda! self.sums remains on cpu despite the “casting” and x.device returns cuda:0, And at __init__ we don’t know the device yet.

Thank you!

wittmannf · April 10, 2019, 10:33pm

Just speculating, maybe fastai.audio could be called fastai.signals instead. There are many applications in the signal processing field that is not only restricted to audio. One example from my experience is the blind source separation field in which seeks to extract source signals from mixed signals. In audio would be equivalent to extract instruments sounds from a song, but the same techniques can be applied in other fields.

stas · April 10, 2019, 10:34pm

fastai.dsp (digital signal processing?)

t-v · April 10, 2019, 10:35pm

One mathematical thing I wanted to mention is that I’d think that keeping track of the variance directly rather than the sum of squares might be numerically advantageous. (The wikipedia references being var/std discussion on the moving average page and Welford’s algorithm and the following variants.)

In my experience, the BN stats can be numerically quite sensitive.

Best regards

Thomas

t-v · April 10, 2019, 10:48pm

Short of redefining the .to, .cuda and .cpu methods (and .float, .double, .half, too), I’m not aware of any. More elegant might be to do the moving in the update_stats (moving is a nop if you’re already there).
Between that an you wanting to serialize momentum, I’d suggest that you look into state dict hooks (and we all hooks, right?).
I’m not sure if and where there is extensive documentation of that, but you’d need Module._register_state_dict_hook and Module.__register_load_state_dict_pre_hook.

sgugger · April 10, 2019, 10:50pm

We changed it because the power made it to close to 1. when the batch size was normal. That’s why we switched to this function that goes more slowly from 0. to 1.

stas · April 10, 2019, 10:51pm

Yes, this is where I tried to do it in update_stats and moving silently fails, see my code example above. A bug in pytorch?

t-v · April 10, 2019, 10:55pm

So I’m too tired to write it more politely, but I have a hunch that you might have fixed the wrong thing. What happens is that with that formula, you set the have mom == mom1 for a batch size of 1, while it might be tempting to leave the mom parameter as what people use with bs = 32 or whatever. So I would suggest that the right thing is not to change the formula, but change the self.mom, because you moved it from the unit “1 batch” to the unit “1 sample” (and as you might guessed from the above, to me the exponent here is “like time”).

Best regards

Thomas

Interogativ · April 10, 2019, 11:01pm

Running batch norm allows for a high learning rate. Didn’t have more than a few minutes this week for your 1 epoch challenge. So just keeping the batch size at 32, I decided to only vary the learning rate I got some very interesting results, but here are the results I got with an lr of 2.

sgugger · April 10, 2019, 11:16pm

The mom is used differently to that what people do in any case: no one uses the averaged statistics during training in regular BatchNorm.

t-v · April 11, 2019, 6:22am

I must say I’m unconvinced that the two are connected.

@sgugger, @jeremy So if I use LAMB as an example rather than BN, is it fair that I might blog why I think the bath-size-adapted-experimental moving average is good in the mom = 1-(1-mom_for_bs_1)**bs way and not disclose parts of your research that you want to keep under the lid for now?

Kaspar · April 11, 2019, 6:55am

i would say undefined instead of infinit

Kjeanclaude · April 11, 2019, 10:17am

Happy to know, I performed my training on Kaggle and had the var() problem too with the current fastai version.

tamhash · April 11, 2019, 12:25pm

Completely not-technical question. I find part 2 v3 very intense but rewarding. Every time, I listen class video I learn something new. But how is everyone using notebooks? Still tried and tested method of repeating work from notebooks many times until I can replicate it without looking back, is the best? Is there anything, I can do on the top of this to get even deeper understanding?

stas · April 11, 2019, 5:41pm

Please read the first post - you need to run torch-nightly for this class, which kaggle isn’t running, so you won’t be able to run this class on the kaggle platform until pytorch-1.1.0 is released and installed on kaggle.