Lesson 10 Discussion & Wiki (2019)

I didn’t think of it as a side-effect - it was the main reason I did it that way. But if there’s a way that avoids serializing unnecessary data, I’d be happy to switch to it. (But which doesn’t require significant extra complexity.)

Originally it was used out of necessity, since we wanted those vars to be stored in the model so that it could be used during inference. In the refactored version if you think it’s valid these are no longer needed.

I think that’s reasonable.

The thing I’d really like is to change that if self.training: to if self.training and (self.steps<100 or self.steps%4==0): , so that once things have stabilized it doesn’t recalc stats so often. Last time I tried to get this to work I had trouble figuring out the detach details. If anyone gets this working please let me know! :slight_smile:

Nope, originally I used it so it would be moved to CUDA automatically. I don’t know how to do it otherwise in a convenient way, either in the existing or the refactored version.

Although AFAICT it also needs to be stored still, since otherwise fine-tuning won’t work.

On a related note, I’ve also seen variable names like “input” used in functions. For example, the definition of nll here.

Yes it always feels odd to me when I do that, but I stay consistent with pytorch so use it in loss functions.

1 Like

Have you tried running nll and then do the actual input() call as it’s intended by python. If the latter breaks then it’s a bug in pytorch and fastai.

a quick test shows that it should break:

vars()
vars = 5
vars() # fails

edit: as @amanmadaan replies latter this is not a problem since it’s a local variable, so it’s ok.

A small clarification…

At 1:52:24, “The variance of a batch of one is infinite.” I think what’s meant here is that the variance calculates as zero, and you would be dividing by zero to normalize the batch to standard deviation 1.

I understand there’s a difference between population variance and sample variance, and that PyTorch var() returns NaN. But for this explanation what is pertinent is why the filter would be scaled to infinity.

I’ll experiment with it since I also want to understand that detach thing.

How would you recommend to “measure” any regression in such fine tuning? I know you usually recommend to keep the randomness, but this situation feels to me should call for a fixed seed at least in the initial steps so that any regressions can be immediately seen. Does it make sense?

Also I think your suggested 100 should be different depending on the bs, no? it’d be a very different measurement with bs=2 vs bs=512.

Right, it will fail if input is used in the function (or the same scope).

The following is legal:

def times5(input):
    print(int(input) * 5)

ip = input("Enter a number")
times5(ip)

which is how pytorch and fastai use the variable input.

The following is not:

input = input("Enter a number")
input
input("Enter another number") #will fail
1 Like

I just run things a few times to get a sense of how stable they are. Generally it’s pretty obvious when something breaks. If I have a fixed seed I find it hard to know if it’s working since I might have got lucky with that one seed.

Yes that’s better. In an earlier version I kept a counter called self.batch and did self.batch += bs. Then you could check self.batch<200 or similar.

1 Like

Ah, yes, of course, it is then a local variable, so as long as the overridden function is not needed in that scope it should not be a problem. Good point.

1 Like

I’ve added some errata to the top post. I may have missed some - please add any more you’re aware of.

1 Like

Fantastic list, Thomas!

What’s the correct way to stash away some data flowing through the layer without affecting that data? i.e. don’t mess up input and output in forward/backward. Is the following correct then?

    def __init__(self)
        self.stash = []
    def forward(self, x):
        if self.training:
            with torch.no_grad(): stash.append(x.clone().detach_())

or is there a more efficient way?

(this of course won’t be the real code, so we aren’t buffering up all of the data, just say a few forward-passes worth of data).

You can’t calculate var with bs=1.

You cannot calculate the (unbiased, you would get a biased one) var of a single-element tensor. But usually you have h > 1 and w > 1 , so that that isn’t a problem. Even for a single-element per channel tensor, you can track (x**2).mean(0, 2, 3) .

To be honest, I’m skeptical of BN when you only have a few features, “traditional” BN is completely bogus with feature planes of 1 (because after normalizing, x, the input will be 0), running BN will be a bit better, but will it be good?

If you look through this lesson’s nb7 it goes to show that layer norm (which is what you suggested) doesn’t work in classification - i.e. it is no longer a batch norm (especially if we don’t have a batch practically bs=1 or 2).

That’s why the proposed by Jeremy Running BN is trying to overcome this problem by interpolating older data with the new one.

I also wonder whether this should be done only when bs is detected to be small, but fallback onto the standard BN if bs is large. Need to measure.

Actually that counter is still in the current version. Although it would need to be updated even when not calculating stats.

1 Like

My thought is that just recalculating stats occasionally is the best of both worlds. Especially if you recalculate every (n) batches rather than every (n) steps.

1 Like

Most of the time, you don’t need the clone but can just do x.detach() - that is unless you have something modifying x in place.

I must admit you lost me on what you’re trying to achieve and what the exact dimensions are of your tensors. Comon BN will take averages over bs, h, w (per channel), while LN will average over channel, h, w).
So to me running BN has three elements over regular BN 1) use an online var estimate instead of mean of vars of batches 2) use the aggregate stats always 3) scale momentum with batch size.

So I must admit that I missed where the mom1 formula with sqrt(n-1) comes from, that seems to be the most problematic part of n=1.
I just checked the video now (without sound) and there it looked like mom1 = 1-(1-mom)**bs used in the video looked roughly “as might be expected”, probably of subannual discount factors and default rates. (Even though apparently two years ago, I was more thinking of a linear adjustment of the momentum).

1 Like

Why is “sudden large change” better than incrementally changing with a step size that depends on batch size? But I missed the change from the geometric momentum interpolation to the division by square root of n-1, too… (Is that an erratum as well?)

It was changed in this commit right after the lesson was filmed, but I don’t know why.

@t-v, do you by chance have an insight on how to avoid using register_buffer and yet get the temp module variables to automatically follow the device when the whole model is moved to say a specific device?

Something that would tell pytorch this variable should follow the module’s device setting, so that it doesn’t have to be manually switched - won’t work in a generic module like BN.

I looked at register_buffer implementation but it doesn’t need to do anything since it add to self._buffers which gets automatically moved to the right device with the model.

To summarize - we have temp variables that need to be on the same device as the model but we don’t want to store them in the model’s state_dict, since they are only used during training.

The only workaround I can think of, is to do:

    def __init__(self, ...):
         [...]
        #self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.sums = torch.zeros(1,nf,1,1)
    def forward(self, x):
        self.sums.to(x.device) # remains on cpu, while x is on cuda
        self.sums.lerp_(cuda vars) # fails on cuda-cpu mismatch

but it won’t cast it to cuda! self.sums remains on cpu despite the “casting” and x.device returns cuda:0, And at __init__ we don’t know the device yet.

Thank you!