Lesson 10 Discussion & Wiki (2019)

t-v · April 8, 2019, 7:09pm

One way to look at it might be that having more outputs than inputs means diluting information.

If you think about this not for images and convolutions but in terms of linear layers, I think it is easier to understand why it isn’t that useful to have more outputs than inputs here: A linear layer of, say, 10d inputs and 100d outputs will map the 10d input space into a 10d subspace of the 100d output space. As such, you have an inefficient representation of a 10d space. (Of course, if you have a ReLU nonlinearity, you have +/- so you might reasonably expect to fill 20d with positive numbers, but 100d still is wasteful.)
But convolutions are particularly linear maps between patches (e.g. look at the torch.nn.Unfold example), i.e. you have (in-channels * w * h) inputs and out-channels outputs, so, as the dimension counting works as you (and Jeremy) propose.

Now we ask for out-channels < in-channels * w * h if we want the output to be a somehow compressed representation (an extreme case for linear is word vectors or other dense representation of categorical features). However, in the up part of a Unet, I would think that we are not really compressing information any more.

You could probably reason about what dimension the space of “reasonable images” has (much lower than pixel-count) and that you’d need to come from a representation that actually resolves this.
It’d be very interesting to see if one can find a pattern of how much is a good ratio and where.

At some point, things like stride vs. correlation of adjacent outputs and pooling probably also play a role.

Best regards

Thomas

jeremy · April 8, 2019, 7:31pm

Wow that’s fantastic!

Why not?

t-v · April 8, 2019, 8:05pm

Thanks!

I would think you would not have compression in the up sampling part of UNet because the dimension of “reasonable” outputs is typically lower than the number of pixels. For segmentation maps that certainly seems to be the case. For natural images my mental model is wavelet compression, which - albeit lossy - seems to hint at a limited output space. As such, if we have an efficient representation of what’s going on, my intuition would be that we have some sort of decompression because the output isn’t as efficient.

jeremy · April 8, 2019, 8:13pm

The key difference in the upsampling path is that you are decreasing the number of filters each layer, vs increasing in the downsampling path. So that’s one reason that compression may not be wanted/needed - you already have some compression in the filter dimension.

DrHB · April 8, 2019, 8:23pm

def append_stats(hook, mod, inp, outp):
    if not hasattr(hook,'stats'): hook.stats = ([],[],[])
    means,stds,hists = hook.stats
    means.append(outp.data.mean().cpu())
    stds .append(outp.data.std().cpu())
    hists.append(outp.data.cpu().histc(40,0,10))

Please correct me if my understanding is wrong . Here hists.append we take activation from a one batch and bin them in to 40 bins ranging from 0 to 10. (in total we do it 108 for each batch)

and later:

def get_min(h):
    h1 = torch.stack(h.stats[2]).t().float()
    return h1[:2].sum(0)/h1.sum(0)

is there any specific reason why we are using first two h1[:2] and not h1[:1]?

jeremy · April 8, 2019, 8:27pm

Not really carefully - I just figured that small activations are mainly dead activations.

DrHB · April 8, 2019, 8:27pm

got it! thanks!

stas · April 8, 2019, 9:03pm

@t-v, I’d like to understand the conv as unfold+@+view - would you kindly annotate the 2nd half of the Unfold example? those numbers make no sense to the uninitiated - 4,5? 7,8?

The first part is mostly clearly annotated and I could follow why the numbers are the way they are. The second one I couldn’t.

Thanks.

Also it might be helpful to use a square image and square kernel in the example, so it’s easier to follow

axelstram · April 9, 2019, 12:20am

What would be the difference between RunningBatchNorm and Batch Renormalization?

jeremy · April 9, 2019, 12:23am

I’ve been asked that once before, but I don’t see how they’re similar. Am I missing something? (I mean, they’re trying to solve a similar problem, but I don’t follow how they’re similar other than that?..)

axelstram · April 9, 2019, 12:41am

Maybe they are not! I just found that paper almost by accident and I didn’t have time yet to read it in depth, but saw that they were addressing the same issue as RunningBatchNorm and was curious if someone had already done the comparison and found what the similarities/differences were. I will try to implement it later, seems like a nice exercise at least.

stas · April 9, 2019, 12:46am

If you’re trying to develop your own BN version you may find these handy to work with a tiny balanced subset of the data (for 07_batchnorm.ipynb) and then you could do some print()-style debug from within your BN version:

x_train,y_train,x_valid,y_valid = get_data()

x_train,x_valid = normalize_to(x_train,x_valid)

def get_subset(x, y, n_classes):
    "extract only entries that are in n_classes (e.g. n_classes=2 for int classes: 0, 1)"    
    return list(zip(*[(x[i],y[i]) for i in range(len(y)) if y[i] < n_classes]))

def get_sample(x, y, n_classes, sample_size): 
    "extract only a sample size from each class"    
    cnt = torch.zeros(n_classes)
    return list(zip(*[(x[i],y[i]) for i in range(len(y)) if cnt[y[i]].add_(1) <= sample_size]))

sample_size = 10 # set to 0 to go full dataset
if sample_size:
    
    n_classes = 2
    
    # stage 1 - get only n-classes
    x5_train,y5_train = get_subset(x_train, y_train, n_classes)
    x5_valid,y5_valid = get_subset(x_valid, y_valid, n_classes)
    # stage 2 - get a sub-sample
    x6_train,y6_train = get_sample(x5_train, y5_train, n_classes, sample_size)
    x6_valid,y6_valid = get_sample(x5_valid, y5_valid, n_classes, sample_size)

    train_ds,valid_ds = Dataset(x6_train, y6_train),Dataset(x6_valid, y6_valid)
    c = n_classes
else:
    train_ds,valid_ds = Dataset(x_train, y_train),Dataset(x_valid, y_valid)
    c = y_train.max().item()+1

nh = 50
bs = 2
nfs = [8] # less layers
#nfs = [8,16,32,64,64]
loss_func = F.cross_entropy
data = DataBunch(*get_dls(train_ds, valid_ds, bs), c)

tanyaroosta · April 9, 2019, 4:16am

Batch norm questions:

what are the pros and cons of running batch norm before or after the ReLu step?
In the runningBatchNorm, the moving average weight (mom) is set at 0.9. Is it possible to set this as a learnable parameter during the training instead of fixing it at a certain value?

Genralized ReLu question:

Can we learn the max value for clamp and leak during the training instead of fixing the values ahead of time like it’s done in the notebooks?

stas · April 9, 2019, 5:05am

What’s the approach that should be used for module __init__'s arguments that are going to be used during inference time?

For example, in RunningBatchNorm the mom arg is only used during training so it is of no concern.

On the other hand we need eps to be defined during inference, so is it not a must for eps to be a registered buffer? If it’s not stored, won’t we have an error during inference after current fastai’s export / load_learner sequence since forward won’t have self.eps defined as the model won’t save it?

t-v · April 9, 2019, 6:58am

Mabe it’s easier when we name all shape constants (I chose them all different in order to at least let you pattern match, but 10 months in, I should have named them):

# Convolution is equivalent with Unfold + Matrix Multiplication + Fold (or view to output shape)

# some input batch
bs, in_channels, in_h, in_w = 1, 3, 10, 12 # take different sizes for w, h to see which is which
inp = torch.randn(bs, in_channels, in_h, in_w)
out_channels, kernel_h, kernel_w = 2, 4, 5

# output shape, see "shape" section in https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d
# but it's much simplified here as stride and dilation are 1
out_h = in_h - (kernel_h - 1) 
out_w = in_w - (kernel_w - 1)
w = torch.randn(out_channels, in_channels, kernel_h, kernel_w) # the conv weight

# so now, afterunfolding
inp_unf = torch.nn.functional.unfold(inp, (kernel_h, kernel_w)) # unfold the input
assert inp_unf.shape == (bs, in_channels * kernel_h * kernel_w, out_h * out_w)
# and reshaping the conv weight
w_linear = w.view(out_channels, in_channels * kernel_h * kernel_w)

# this is linear:
out_unf = inp_unf.transpose(1, 2).matmul(w_linear.t()).transpose(1, 2)   # apply linear transformation on "dimension 1"

# check output shape
assert out_unf.shape == (bs, out_channels, out_h * out_w)

# and now we just need to get back into shape
out = torch.nn.functional.fold(out_unf, (out_h, out_w), (1, 1))  # the (1, 1) stride is fixed here because we didn't calculate stuff we don't need
# or equivalently (and avoiding a copy),
# out = out_unf.view(bs, out_channels, out_h, out_w)

(torch.nn.functional.conv2d(inp, w) - out).abs().max() # compare

Best regards

Thomas

t-v · April 9, 2019, 7:24am

If you save the state_dict and eps is not included, you’ll get back the default eps when you take a fresh instance (with default eps if you don’t specify it) and load the state_dict.
If you save the module, eps will be saved as well, though that is usually not recommended because it may act funny when code changes between save and load (see in particular the last sentence).

a = torch.nn.BatchNorm1d(10, eps=0.01)
torch.save(a, "/tmp/x.pt")
b = torch.load("/tmp/x.pt")
print ("loaded module eps", b.eps)
c = torch.nn.BatchNorm1d(10)
c.load_state_dict(a.state_dict())
print ("loaded state dict eps", c.eps)

Kaspar · April 9, 2019, 8:44am

it all about mapping the input to a representation where the classes (output) are easier to separate. so whether adding dimension is a good idea must depend on the data.
Fx quaternions are 4 dimensional representation of 3 dimensional euler orientations. In quaternions space manipulation of angels becomes simple additions/subtractions which is way more efficient for computers.

t-v · April 9, 2019, 9:09am

Good point! To me, the absence of singularities in the parametrisation seems to be even more important than the computational apsects (where the log/exp/sin/arcsin/… will also play a rôle). Note, though, that the log(!)-unit-quaternions are a decidedly non-linear embedding. The non-linearity might be the much more important part and the dimension only a necessity to achieve that.

Kernel machines, too, are an example of going into high dimensions to disentangle representation, and they’re an ML classic. However

If you’re applying ReLU next, you’ll not benefit much from having (many more) dimensions.
For kernel machines (and maybe your rotation-as-quaternions example, too), you crucially need the interaction of several inputs to reap benefits which we usually don’t have pe se in NNs.
Conventional wisdom these days says that depth beats width.

Best regards

Thomas

tomsthom · April 9, 2019, 10:22am

After doing some tests on the ratio of “dead” activations and RELU customization, it seems the most important parameter is to substract ~0.5 using the Generalized RELU.

With this parameter alone, we can get the ratios shown on lesson 10 of less than 20% of “dead” activations.
The leaky RELU or clamping gave better-looking histogram (less “reboots” visible on the histograms) but the ratio of “dead” activations is still very high (more than 80%).

But when I look at the performance of the model, I have not found a big difference between a standard RELU and the customized RELU (substracting ~0.5 and/or with leak).
I would have guessed that using all activations would give much better results than the standard RELU with a lot of dead activations. But it’s very close.
I will do more testing to try to find out an explanation. Do you get the same results?

stas · April 9, 2019, 4:59pm

Thank you for confirming that eps should be a registered buffer, @t-v. fastai currently saves only the model params via state_dict.

Using a default would be a bug since that would be different from what the model was trained with (if the default was overridden).

Given that we often have modules defined in a jupyter notebook, I don’t think the defaults will even work, since pytorch won’t be able to find the source.