Part 2 Lesson 14 Wiki


(Rachel Thomas) #1

This is a wiki thread. Please add links/tips/etc.

<<< Wiki: Lesson 13

Links

Papers


Part 2 Lesson 13 Wiki
About the Part 2 & Alumni category
(YangLu) #2

What is your seq2seq_reg doing?

def seq2seq_reg(output, xtra, loss, alpha=0, beta=0):
    hs,dropped_hs = xtra
    if alpha:  # Activation Regularization
        loss = loss + sum(alpha * dropped_hs[-1].pow(2).mean())
    if beta:   # Temporal Activation Regularization (slowness)
        h = hs[-1]
        if len(h)>1: loss = loss + sum(beta * (h[1:] - h[:-1]).pow(2).mean())
    return loss

Alpha part is relatively easy to see: it’s L2 reg. of last hidden layer, right? (not sure about Dropped part. )

Beta part is beyond me.

Read paper, still confused.


(Kevin Bird) #12

Maybe a dumb question, but why do you need a ReLU at all? could you possibly just have two back to back convs there because ReLU is also changing things isn’t it?


(adrian) #14

Re: BatchNorm: Parallel processing on multi GPU’s - tips for doing this with current fastai codeset?


(blake west) #15

Could you do gradient clipping or lower learning rates at the beginning? And why is res scaling different than reducing the learning rate? Just curious if he tried other more normal tricks before going to this strange res_scaling thing.


#16

If you want to try LARS, it’s very easy to implement as an optimizer in pytorch (did it in this gist).


(Gerardo Garcia) #17

Isn’t that what the NVIDIA demo is doing?


(Ramesh Sampath) #18

Are we using VGG16 n the model? SrResnet seems to build a model from Scratch?


(KRO) #19

What is a “learnable convolution” and what is an example of a convolution that isn’t learnable?


(KRO) #20

Curious about your context: isn’t what what the NVIDIA thing is doing?


(Gerardo Garcia) #21

(KRO) #22

Why are we using these little 3x3 squares of every color, instead of using noise in the new pixels?

I understand why we don’t just leave them blank, and maybe why we don’t copy the nearest-neighbors. But why not noise?


(KRO) #23

Does this mean I can replace

m = nn.DataParallel(m, [0,2])

with something, to get rid of the error below?

RuntimeError: cuda runtime error (10) : invalid device ordinal at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/THCTensorCopy.cu:204


(KRO) #24

Because then the sequential layers would functionally just be one layer, I think.


(blake west) #25

Yeah you probably want to change the [0,2] to only contain numbers that actually correspond to GPU’s on your computer. Like, maybe [0,1]?


(nkiruka chuka-obah) #26

Can he explain progressive resizing again? I don’t understand how to use it


(KRO) #27

thanks … but yeah I had tried [0,0] and it didn’t help; [0,1] didn’t either.
I wonder how to find out what the correct values would be!!


(Wayne Nixalo) #28

Huh… I wonder if using load state_dict(strict=False) would work as a quick way to load weights from a pretrained model. Say: pretrained keras/tensflow retinanet, if you more/less match the architecture in pytorch.


(Sneha Nagpaul) #29

Also, can we use progressive resizing to match the idea of backbone + head?


(Kevin Bird) #30

Is that a checkerboard pattern on the bluejay?