Correct way of incorporating L1 regularisation

sachinruk · December 1, 2019, 11:30am

This is more of a mathematical question than it is about the coding. After reading Fastai’s blog on ADAMW, I was wondering what the correct way of incorporating l1 loss is. The two callbacks (ways) of doing it listed below:

class L1Loss(LearnerCallback):
    def __init__(self, learn, beta=1e-2):
        super().__init__(learn)
        self.beta = beta
    
    def on_backward_end(self, **kwargss):
        optimizer = self.learn.opt.opt
        for group in optimizer.param_groups:
            for param in group['params']:
                sign = param.data / (torch.abs(param.data) + 1e-9)
                param.data = param.data - self.beta * sign

and second one is to just add it to the loss:

class L1Loss(LearnerCallback):
    def __init__(self, learn, beta=0.2):
        super().__init__(learn)
        self.beta = beta
    
    def on_backward_begin(self, **kwargs):
        weights = [torch.abs(v).sum() for k,v in self.learn.model.named_parameters() 
                   if not 'bias' in k]
        last_loss = kwargs['last_loss'] + self.beta * sum(weights)
        return {'last_loss': last_loss}

I ran the experiments on a colab notebook. It seems the ‘proper way’ is creating more weights closer to zero.

The question I have is, given enough time would both ways end up at the same answer? Or does raw ADAM with any kind of weight penalty push weights in certain directions?

akashpalrecha · December 3, 2019, 1:00pm

I think you mean L1 Regularization and not L1 loss?

sachinruk · December 4, 2019, 2:31am

whoops. Yep. Corrected it. Thanks!