Lesson 10 Discussion & Wiki (2019)

a_yasyrev · April 10, 2019, 1:09pm

During imports we use datasets from fastai, so we can’t just instal pytorch-nightly without fastai.

jeremy · April 10, 2019, 1:09pm

That is correct!

jeremy · April 10, 2019, 1:32pm

I guess you’re referring to changing it as so?:

I’ve updated the final notebook version with this one - thanks for that!

t-v · April 10, 2019, 2:27pm

Yes, that’s what I meant and I think that is the most efficient way of writing it.

Kaspar · April 10, 2019, 2:38pm

The following 2 lines the lesson 10 can be replaced:

#This monkey-patch is there to be able to plot tensors.
torch.Tensor.ndim = property(lambda x: len(x.shape))

with this:

def get_min(h):
   h1 = torch.stack(h.stats[2]).t().float()
   return (h1[:2].sum(0)/h1.sum(0)).numpy()

because .numpy() is just a view on the underlying tensor storage as shown here:

t = get_min(hooks[0])[:5]
n = t.numpy()
print(f"tensor:{t}"), print(f"numpy view on tensor:{n}")
n *= 0.0
print(f"tensor:{t}"), print(f"numpy view on tensor:{n}")

tensor:tensor([0.8043, 0.8071, 0.8080, 0.7945, 0.7856])
numpy view on tensor:[0.8043014 0.80710274 0.80797964 0.7945432 0.7856395 ]
tensor:tensor([0., 0., 0., 0., 0.])
numpy view on tensor:[0. 0. 0. 0. 0.]

tanyaroosta · April 10, 2019, 3:25pm

so regarding the batch norm before or after, I know that the original paper states that BN is before ReLu. That said, I think it makes more sense after ReLu if we are trying to stabilize the inputs to the next layer. I can do some tests with a simple network to see how the results change, but I like to have a better theoretical answer as to why one approach is better than the other.

As I was digging for information, I found this video by Ian Goodfellow on BN, if anyone is interested. https://youtu.be/Xogn6veSyxA?t=325

tank13 · April 10, 2019, 3:45pm

A couple questions surrounding the reshaping of our MNIST data and the related callbacks:

would reshaping all of the images to 28x28 as a preprocessing step also be a valid approach, or is there a reason we have to reshape and then flatten later?
is it important that the reshaping callback have _order=2, or could the order be lower?
this function totally warped my mind

def view_tfm(*size):
    def _inner(x): return x.view(*((-1,)+size))
    return _inner

I get that we’re applying tuple unpacking to size so it can be arbitrary, and that what we want to ultimately pass to view is the same shape that we specified in size but prepended by a -1, and that’s what the (-1,) is doing. But it seems to me that what view receives is the unpacked resulting tuple, meaning “-1 1 28 28” in our case (this is also what I get when I add print(*((-1,) + size)) into view_tfm). However, this must not be the case because view(-1 1 28 28) is not syntactically valid, while view_tfm works fine. So my two questions on this are:

what is view receiving as arguments, and how does it happen?
what is the “right” way to interrogate *((-1,) + size) or any similarly baffling expression further? Calling type on it or trying to assign it to a variable give errors; I tried stepping through using the debugger (which I am horrible with, so here might be the problem), but since the thing I wanted to look at more closely doesn’t have a name I had trouble knowing what to do. Suggestions?

Cheers!

jeremy · April 10, 2019, 4:59pm

Yes but it’s much better to be able to use pytorch with matplotlib, so you don’t have to convert just to plot.

stas · April 10, 2019, 5:03pm

Hmm, right, I updated the first post - removed the first way which suggested fastai wasn’t needed for part2 lessons. Thank you.

Kaspar · April 10, 2019, 5:28pm

you are so right so i created this request with pytorch: https://github.com/pytorch/pytorch/issues/19119

stas · April 10, 2019, 5:29pm

Thank you for the improvement suggestions performance-wise, @t-v!

If you have to keep x , you need detach it to avoid the memory problem.

Even though this code is being run in with torch.no_grad():?

Incremental cat self.x = torch.cat([self.x, x]) is bad! It’s quadratic complexity where it should be linear, and that does show frequently in real problems. Love the Python lists, use the Python lists.

So you’re suggesting list append and then a single cat, correct?

for i in range(n): l.append(x) # simulate multiple forward calls
x = torch.cat(l)

Much(!) more efficient in terms of memory would be to keep the mean and var (or uncentered second moment like Jeremy if you want to avoid the effort of saving the var-adjusting-to-a-new-mean) and combine those.

This doesn’t work, since this is exactly the problem we are trying to solve - where variance is often nan because there is not enough data to calculate it on and my idea was to gather enough data to do it on.

(edit: replaced mean with variance)

And I know my attempt was very memory inefficient. It was just an exercise of approaching it in a simple way rather than with finesse.

stas · April 10, 2019, 5:37pm

@jeremy, between BatchNorm and RunningBatchNorm with torch.no_grad() was removed - is it no longer needed because you instead used detach on the large vectors inside update_stats?

@t-v, what about 0D variables inside RunningBatchNorm.update_stats - shouldn’t those be detached too?

And is this still the case that one needs to detach if a variable is not a buffer or parameter but say just doing:

__init__:
self.counter = 0
forward:
self.counter +=1

I thought normal variables won’t attach themselves to the graph, unless they were created with requires_grad=True or are used in a calculation of a variable that is already part of a graph.

t-v · April 10, 2019, 5:45pm

torch.no_grad() will not connect new variables to the graph, but it won’t do anything to x (which you probably had before).

So you’re suggesting list append and then a single cat, correct?

Yes. The “obvious” pattern is correct here.

This doesn’t work, since this is exactly the problem we are trying to solve - where mean is often nan because there is not enough data to calculate it on and my idea was to gather enough data to do it on.

mean should be non-nan once you have tensors with more 0 elements, but if your input sizes vary, you’d probably want sum and keep track of numel.
var will not be defined with bs * w * h = 1, but then you’re doing something fundamentally wrong, probably. non-centered moments should work just as mean does.
Maybe I don’t quite understand what you’re trying to do, though.

Best regards

Thomas

t-v · April 10, 2019, 5:45pm

With any bit of luck, they’re not requiring grad, so no detaching needed.

stas · April 10, 2019, 5:55pm

I think I’m a little bit lost, and putting some context back will help. So If I want to save a copy of x inside a layer, so that I could refer to it in later forward passes, like so:

forward:
with torch.no_grad(): l.append(x)

I can’t detach x or it’d mess up the original x, no? So do I need to clone x instead and set it not to require grad?

In other words, what’s the correct way to stash away some data flowing through the layer without affecting it? i.e. don’t mess up input and output in forward/backward.

var will not be defined with bs * w * h = 1, but then you’re doing something fundamentally wrong, probably. non-centered moments should work just as mean does.
Maybe I don’t quite understand what you’re trying to do, though.

We are trying to solve a problem where there is not enough RAM and a user uses bs=2 or bs=1. You can’t calculate var with bs=1. So I save that single input, do nothing in this forward pass, then concatenate it with a new bs=1 input from the following pass and then I might be able to calculate variance. (but more like needing at least 4-8 data points - so need to aggregate it to 4-8 mini-batches if bs=1 or 2. Is this helpful?

This doesn’t work, since this is exactly the problem we are trying to solve - where mean is often nan

oh, I see I didn’t write what I was intending. I meant to say variance instead of mean.

t-v · April 10, 2019, 6:25pm

No. There are six or so cases

x.detach_() change tensor to not require grad --> You don’t want this.
x.detach() new tensor with same memory(!) and no requires_grad, unconnected to the graph. --> this is what you want if you save x (which you should not) or for using that in calculating mean/std.
x.clone() new tensor and new memory but grad-connected if x requires grad
x.clone().detach_() new tensor, new memory, no requires grad, unconnected to the graph
x.detach().requires_grad_() new tensor, same memory, requires grad, but not connected (i.e. leaf)
x.clone().detach_().requires_grad_() oh well, you’re bored by now.

no_grad might be odd to use here.

You cannot calculate the (unbiased, you would get a biased one) var of a single-element tensor. But usually you have h > 1 and w > 1, so that that isn’t a problem. Even for a single-element per channel tensor, you can track (x**2).mean(0, 2, 3).

To be honest, I’m skeptical of BN when you only have a few features, “traditional” BN is completely bogus with feature planes of 1 (because after normalizing, x, the input will be 0), running BN will be a bit better, but will it be good?

Best regards

Thomas

stas · April 10, 2019, 6:35pm

@jeremy, looking at the latest incarnation of RunningBatchNorm, why are we recalculating everything for inference? Here is a refactored version:

#export
class RunningBatchNorm(nn.Module):
    def __init__(self, nf, mom=0.1, eps=1e-5):
        super().__init__()
        self.mom, self.eps = mom, eps
        self.mults = nn.Parameter(torch.ones (nf,1,1))
        self.adds  = nn.Parameter(torch.zeros(nf,1,1))
        self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.register_buffer('sqrs', torch.zeros(1,nf,1,1))
        self.register_buffer('count', tensor(0.))
        self.register_buffer('factor', tensor(0.))
        self.register_buffer('offset', tensor(0.))
        self.batch = 0
        
    def update_stats(self, x):
        bs,nc,*_ = x.shape
        self.sums.detach_()
        self.sqrs.detach_()
        dims = (0,2,3)
        s    = x    .sum(dims, keepdim=True)
        ss   = (x*x).sum(dims, keepdim=True)
        c    = s.new_tensor(x.numel()/nc)
        mom1 = s.new_tensor(1 - (1-self.mom)/math.sqrt(bs-1))
        self.sums .lerp_(s , mom1)
        self.sqrs .lerp_(ss, mom1)
        self.count.lerp_(c , mom1)
        self.batch += bs
        means = self.sums/self.count
        vars = (self.sqrs/self.count).sub_(means*means)
        if bool(self.batch < 20): vars.clamp_min_(0.01)
        self.factor = self.mults / (vars+self.eps).sqrt()
        self.offset = self.adds - means*self.factor
        
    def forward(self, x):
        if self.training: self.update_stats(x)
        return x*self.factor + self.offset

The only thing I can’t figure out is how to get rid of the first 3 buffers - they no longer need to be saved in the model and can be normal vars, but if I replace them with normal vars I have the device issue CUDA vs. CPU, e.g. if I replace:

        #self.register_buffer('sums', torch.zeros(1,nf,1,1))
        self.sums = torch.zeros(1,nf,1,1)

I get:

---> 24         self.sums .lerp_(s , mom1)
     25         self.sqrs .lerp_(ss, mom1)
     26         self.count.lerp_(c , mom1)

RuntimeError: Expected tensor to have CPU Backend, but got tensor with CUDA Backend (while checking arguments for CPU_tensor_apply)

So I have to then do an explicit cuda() or to() when assigning a tensor in those vars, but I don’t know how to do it so that it’ll work transparently regardless of user’s setup. It seems that register_buffer does the right thing.

p.s. RunningBatchNorm uses a variable vars - which is a built-in function. so that’s probably not a good idea

jeremy · April 10, 2019, 6:39pm

Yeah that’s the other reason to use buffers.

jeremy · April 10, 2019, 6:40pm

TIL. Will change it.

stas · April 10, 2019, 6:41pm

But it sounds like we are then using it for the side-effect of it - why store in the model something that is a temp variable?

There must be a better way.