Correct way of incorporating L1 regularisation

(Sachin) #1

This is more of a mathematical question than it is about the coding. After reading Fastai’s blog on ADAMW, I was wondering what the correct way of incorporating l1 loss is. The two callbacks (ways) of doing it listed below:

class L1Loss(LearnerCallback):
    def __init__(self, learn, beta=1e-2):
        self.beta = beta
    def on_backward_end(self, **kwargss):
        optimizer = self.learn.opt.opt
        for group in optimizer.param_groups:
            for param in group['params']:
                sign = / (torch.abs( + 1e-9)
       = - self.beta * sign

and second one is to just add it to the loss:

class L1Loss(LearnerCallback):
    def __init__(self, learn, beta=0.2):
        self.beta = beta
    def on_backward_begin(self, **kwargs):
        weights = [torch.abs(v).sum() for k,v in self.learn.model.named_parameters() 
                   if not 'bias' in k]
        last_loss = kwargs['last_loss'] + self.beta * sum(weights)
        return {'last_loss': last_loss}

I ran the experiments on a colab notebook. It seems the ‘proper way’ is creating more weights closer to zero.

The question I have is, given enough time would both ways end up at the same answer? Or does raw ADAM with any kind of weight penalty push weights in certain directions?


(Akash Palrecha) #2

I think you mean L1 Regularization and not L1 loss?


(Sachin) #3

whoops. Yep. Corrected it. Thanks!