Part 2 Lesson 14 Wiki

This is a wiki thread. Please add links/tips/etc.

<<< Wiki: Lesson 13

Links

Papers

6 Likes

What is your seq2seq_reg doing?

def seq2seq_reg(output, xtra, loss, alpha=0, beta=0):
    hs,dropped_hs = xtra
    if alpha:  # Activation Regularization
        loss = loss + sum(alpha * dropped_hs[-1].pow(2).mean())
    if beta:   # Temporal Activation Regularization (slowness)
        h = hs[-1]
        if len(h)>1: loss = loss + sum(beta * (h[1:] - h[:-1]).pow(2).mean())
    return loss

Alpha part is relatively easy to see: it’s L2 reg. of last hidden layer, right? (not sure about Dropped part. )

Beta part is beyond me.

Read paper, still confused.

1 Like

Maybe a dumb question, but why do you need a ReLU at all? could you possibly just have two back to back convs there because ReLU is also changing things isn’t it?

2 Likes

Re: BatchNorm: Parallel processing on multi GPU’s - tips for doing this with current fastai codeset?

1 Like

Could you do gradient clipping or lower learning rates at the beginning? And why is res scaling different than reducing the learning rate? Just curious if he tried other more normal tricks before going to this strange res_scaling thing.

1 Like

If you want to try LARS, it’s very easy to implement as an optimizer in pytorch (did it in this gist).

14 Likes

Isn’t that what the NVIDIA demo is doing?

Are we using VGG16 n the model? SrResnet seems to build a model from Scratch?

4 Likes

What is a “learnable convolution” and what is an example of a convolution that isn’t learnable?

1 Like

Curious about your context: isn’t what what the NVIDIA thing is doing?

1 Like

Why are we using these little 3x3 squares of every color, instead of using noise in the new pixels?

I understand why we don’t just leave them blank, and maybe why we don’t copy the nearest-neighbors. But why not noise?

3 Likes

Does this mean I can replace

m = nn.DataParallel(m, [0,2])

with something, to get rid of the error below?

RuntimeError: cuda runtime error (10) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204

Because then the sequential layers would functionally just be one layer, I think.

1 Like

Yeah you probably want to change the [0,2] to only contain numbers that actually correspond to GPU’s on your computer. Like, maybe [0,1]?

1 Like

Can he explain progressive resizing again? I don’t understand how to use it

1 Like

thanks … but yeah I had tried [0,0] and it didn’t help; [0,1] didn’t either.
I wonder how to find out what the correct values would be!!

Huh… I wonder if using load state_dict(strict=False) would work as a quick way to load weights from a pretrained model. Say: pretrained keras/tensflow retinanet, if you more/less match the architecture in pytorch.

1 Like

Also, can we use progressive resizing to match the idea of backbone + head?

Is that a checkerboard pattern on the bluejay?

3 Likes