Lesson 10 Discussion & Wiki (2019)

I was worried about falling behind, not even fully grasped lesson 9 contents… Lesson 10 has been out for almost a week now. I thought “the hell I’ll watch it even if I don’t get everything”, so as to keep up with the pace.
And then the video starts with a perfect “don’t worry” !

Thanks Jeremy for putting extraordinary content out there while still taking care of the not so extraordinary people like me and making sure they don’t despair. I believe making sure everybody makes it is the mark of awesome teachers, and Rachel and you certainly are !

3 Likes

To be honest, I don’t really see that you’ll be loading weights when you

  1. have instantiated the exact same model,
  2. forgot the hyperparameters you are using,
  3. want the same hyperparameters.

What have you done with the LR in that situation?

When loading a model state dict, you already instantiated the model and are all set up.

But maybe I’m just blinded by how I use notebooks. Note that if you want to register eps as a buffer, you make it a (0d) tensor and that has different semantics than a float (most apparent when you bring the JIT into the game). An Alternative might be to tweak the state_dict methods (see the PyTorch BN source for an example).

Fun (well) fact: I did submit a patch to Python about finding the source for classes, but from my experience with that, it seems that that just isn’t for me.

Best regards

Thomas

This is perfect, @t-v - thank you! It’d be great to update the pytorch doc with this version (and preserving the vertical white-space as you wrote it - helps to group things better and make it more readable).

I also found a visual representation of the process. Perhaps linking to that paper with the image would help the readers of the unfold docs?


(The bottom line of the diagram)

From High Performance Convolutional Neural Networks for Document Processing.

So you basically take input impressions of kernel size, flatten them and then do matrix multiplication with a flattened kernel.

And I remember Jeremy also had a different visual representation in his slides.

1 Like

In your own work that shouldn’t be a problem. I was thinking more of - here is a saved trained model (e.g. imagenet) - you can use it for your problem - how for example would you configure the eps arg of nn.BatchNorm1d which is part of that saved model in such a case.

lr is a hyperparameter of the optimizer which is not part of the model, so LR is not part of the model. But if you have hyperparameters that are part of the model and you didn’t save them in the model, how do you configure them for inference?

Is my use example more clear?

But maybe I’m just blinded by how I use notebooks. Note that if you want to register eps as a buffer, you make it a (0d) tensor and that has different semantics than a float (most apparent when you bring the JIT into the game). An Alternative might be to tweak the state_dict methods (see the PyTorch BN source for an example).

I haven’t tried JIT yet, hoping Jeremy will have a chance to do some demos in the upcoming lessons. Thank you for the heads up on that distinction, @t-v!

Fun (well) fact: I did submit a patch to Python about finding the source for classes, but from my experience with that, it seems that that just isn’t for me.

Sorry you didn’t have a welcomed experience by the gatekeepers, Thomas :frowning:

In particular I think those who have never used jupyter have much of an idea of what kind of problem you’re trying to solve - since when they hear ipython - to them it’s a quick throwaway situation, and jupyter while using ipython is a very different beast.

Hey all, I was working on a side project to parse the top python repos on GitHub so I could see what the most used libraries and functions were to help people who are new to the language to learn more efficiently. When Jeremy said in lecture 10 “these are the dunder methods you need to know”, I thought it’d be fun to fact check him :joy: I hope you enjoy.

6 Likes

Hate to be picky here, but you’re right, the entropy notebook on the git repo is not the same as the one jeremy showed. Not that it would be that hard to create watching the video.

Ok, so I decided to create one, couldn’t figure out how to make a gist from an excel spreadsheet. So here’s a link to it on my google drive

Why would you measure “dead” activations in terms of the activation level as opposed to, say, the gradient you get for them or whether they vary with the input?
Now I can see that activation level is easier to get (in particular compared to whether activations change with the input), but it would seem that “fixed” nonzero activations are uninformative. Do those not happen?

2 Likes

I had an idea to try to solve the small batch problem in BatchNorm by buffering up enough of mini-batches and then calculating the stats and apply them, every N mini-batches. It didn’t work too well and requires a small lr and will require more GPU RAM to buffer up inputs, which would be a problem since usually users use a small bs when they are short on GPU RAM in first place. But I thought I’d share, in case someone has some creative ideas to improve upon my attempts:

# Based on BatchNorm implementation from 07_batchnorm.ipynb
# This version buffers up until enough mini-batches are gathered to get a good variance measurement.
# with high learning rate this doesn't work, since delaying normalization even by 2 passes often leads to explosion or vanishing of data.
class AccBatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom,self.eps = mom,eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds  = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('vars',  torch.ones(1,nf,1,1))
        self.register_buffer('means', torch.zeros(1,nf,1,1))        
        self.x = None
        self.bs_acc = 0
        self.bs_goal = 0

    def forward(self, x):
        if self.training:
            with torch.no_grad(): m,v,ok = self.update_stats(x)
            if not ok: return x*self.mults + self.adds     
        else: m,v = self.means,self.vars
        x = (x-m) / (v+self.eps).sqrt()
        return x*self.mults + self.adds
    
    def update_stats(self, x):
        bs,nc,*_ = x.shape
        if not self.bs_goal:
            proportion = 4 # max 4 forward runs w/o updates (except bs=1, 8 runs)
            bs_min,bs_max = 8,256
            bs_goal = bs*4
            if bs_goal > bs_max: bs_goal = bs_max 
            if bs_goal < bs_min: bs_goal = bs_min
            print(f"got bs={bs}, use bs_goal={bs_goal}")
            self.bs_goal = bs_goal
                     
        if bs < self.bs_goal:
            if self.x is None: self.x = x
            else:              self.x = torch.cat([self.x, x])               
            self.bs_acc += bs
            if self.bs_acc < self.bs_goal: return None, None, False
            m = self.x.mean((0,2,3), keepdim=True)
            v = self.x.var ((0,2,3), keepdim=True)
            # reset buffers
            self.bs_acc = 0
            self.x = None     
        else:
            m = x.mean((0,2,3), keepdim=True)
            v = x.var ((0,2,3), keepdim=True)
                     
        self.means.lerp_(m, self.mom)
        self.vars.lerp_ (v, self.mom)
        return m, v, True

I tried a bunch of different workarounds so if you are inspired to make suggestions please consider to try them in action first :wink: Just put it in the 07th nb and run it, substituting BatchNorm with AccBatchNorm inside conv_rbn.

And if you don’t want to read through it, the TLDR version is:

    def update_stats(self, x):
        if not enough buffer up:
            if self.x is None: self.x = x
            else:              self.x = torch.cat([self.x, x])      
            return early
        else:
            as in BN-original

Another idea I had is to calculate variance and buffer mini-batches up until it’s good enough (according to some threshold), but I think it’d have the same problem as this version, since while data is getting buffered up and BN is being delayed meanwhile gradients tend to blow up or disappear within a few mini-batches of non-action. So a very low lr is required.

1 Like

Just finished watching lesson 10
Thanks for coming up with running BN and implementing it. Sounds like a great idea. In hindsight it seems like such an obvious thing to do. That’s a good sign.

I see a vague similarity with Adam in that we keep decaying averages of first and second moment '(and have to debias). Maybe there’s a common theme there.

I also wonder about whether this could somehow be applied to RNNs. Maybe each pass through the recurrent layer could contribute to the running BN?

This is a great question - I’d love to see if you or some of the other participants could have a go at answering it first. It’s an excellent opportunity to test your understanding of the last lesson.

clamp and bn momentum are both used to stabilize training, which isn’t something the loss function measures directly, so these might not work well as learnable params. Leak however can be learned, which creates a new activation function called prelu (which is in the He et al paper we’ve been discussing recently).

3 Likes

Yeah like @t-v says, not an error - but definitely a big problem! It should indeed be a buffer.

1 Like

I don’t think you’ll really see any impact until you have a problem to solve that actually needs all those params. For mnist we have a simple problem and probably more params than we need.

1 Like

Quick question about unfold (sorry it it’s already in the docs and I missed it): does it actually create those tensors in memory (which would be very inefficient for doing a conv, for instance) or does it create an efficient view using stride etc?

Same thing when we convert a “date field” into it’s parts (yea, month, day, is_month_start …).

1 Like

As far as I know, it does create those tensors in memory and is very inefficient for doing a conv. (I did some experiments replacing the “average” in the conv with "medians in the hope of getting a more robust thing, but it really runs out of memory fast, too.)
By the time you have padding, you can’t do this as a view, I didn’t check whether you can if you don’t. Another thing to be aware of is that I would expect the gemm implementations to want contiguous inputs, so if you have some cleverly strided view you’ll still get a copy when calling the gemm. So a key memory thing for implementing your own would likely be to wrap your new conv in an autograd.Function and avoid keeping the instantiated matrix around between forward and backward. The contiguous copy for gemm might automatically be temporary in that sense.

Related anecdote: When I ported PyTorch to Android, the fallback THNN conv (that I think uses this at least “morally”) wanted to allocate so much memory in the Style Transfer example that Android refused (I think the limit is 0.5GB). I solved it by bringing NNPack back for CPU convolutions which also offered a ridiculously large speed increase. So you really want to have PyTorch built with MKLDNN (the default, I think) or NNPack when running on CPU.

Best regards

Thomas

2 Likes

I should emphasize that I don’t believe that that isn’t representative of the Python community as a whole and possibly not even of the particular person. He might have been in a bad mood or maybe my style of presenting it triggered something that didn’t sit well with him or whatever.

The impact one could make by submitting patches to Python is probably far greater than what I usually do, so I don’t want to discourage others. I’m sure that anyone else will be much happier than me after submitting patches.

1 Like

My understanding is that batchnorm is helping with:

  • Stabilizing the input of the next layer (less change of variance and scale after each iteration and between batchs)
  • Helping to shift/scale the activation without needing to change all the weights (using mults and adds)

So, I would say it seems much better to apply it after RELU. If you apply it before, the stats of the activations will be modified by the RELU.
Is it a correct understanding?

2 Likes

Some thoughts about your code without trying it out:

  1. If you have to keep x, you need detach it to avoid the memory problem.
  2. Incremental cat self.x = torch.cat([self.x, x]) is bad! It’s quadratic complexity where it should be linear, and that does show frequently in real problems. Love the Python lists, use the Python lists.
  3. Much(!) more efficient in terms of memory would be to keep the mean and var (or uncentered second moment like Jeremy if you want to avoid the effort of saving the var-adjusting-to-a-new-mean) and combine those.

Regarding the third point, when thinking about computational cost of BN (and it is an expensive operation), x is a large tensor and expensive to operate on / keep while running stats are small.
This is also something to keep in mind here:

        x = (x-m) / (v+self.eps).sqrt()
        return x*self.mults + self.adds

It’s much more efficient to combine m, (v+self.eps).sqrt() with self.mults / self.adds first unless you know exactly that it’s not going to have an impact. (Even broadcasted tensors can have a significant impact, e.g. this tries to mitigate the lacking “read broadcasted values only one”.)

Best regards

Thomas

1 Like

For the max accuracy in a single epoch. I was able to use a power function and get it to 99.1%

1 Like