def seq2seq_reg(output, xtra, loss, alpha=0, beta=0):
hs,dropped_hs = xtra
if alpha: # Activation Regularization
loss = loss + sum(alpha * dropped_hs[-1].pow(2).mean())
if beta: # Temporal Activation Regularization (slowness)
h = hs[-1]
if len(h)>1: loss = loss + sum(beta * (h[1:] - h[:-1]).pow(2).mean())
return loss
Alpha part is relatively easy to see: it’s L2 reg. of last hidden layer, right? (not sure about Dropped part. )
Maybe a dumb question, but why do you need a ReLU at all? could you possibly just have two back to back convs there because ReLU is also changing things isn’t it?
Could you do gradient clipping or lower learning rates at the beginning? And why is res scaling different than reducing the learning rate? Just curious if he tried other more normal tricks before going to this strange res_scaling thing.
Huh… I wonder if using load state_dict(strict=False) would work as a quick way to load weights from a pretrained model. Say: pretrained keras/tensflow retinanet, if you more/less match the architecture in pytorch.