 # Correct way of incorporating L1 regularisation

(Sachin) #1

This is more of a mathematical question than it is about the coding. After reading Fastai’s blog on ADAMW, I was wondering what the correct way of incorporating l1 loss is. The two callbacks (ways) of doing it listed below:

``````class L1Loss(LearnerCallback):
def __init__(self, learn, beta=1e-2):
super().__init__(learn)
self.beta = beta

def on_backward_end(self, **kwargss):
optimizer = self.learn.opt.opt
for group in optimizer.param_groups:
for param in group['params']:
sign = param.data / (torch.abs(param.data) + 1e-9)
param.data = param.data - self.beta * sign
``````

and second one is to just add it to the loss:

``````class L1Loss(LearnerCallback):
def __init__(self, learn, beta=0.2):
super().__init__(learn)
self.beta = beta

def on_backward_begin(self, **kwargs):
weights = [torch.abs(v).sum() for k,v in self.learn.model.named_parameters()
if not 'bias' in k]
last_loss = kwargs['last_loss'] + self.beta * sum(weights)
return {'last_loss': last_loss}
``````

I ran the experiments on a colab notebook. It seems the ‘proper way’ is creating more weights closer to zero.

The question I have is, given enough time would both ways end up at the same answer? Or does raw ADAM with any kind of weight penalty push weights in certain directions?

0 Likes

(Akash Palrecha) #2

I think you mean L1 Regularization and not L1 loss?

0 Likes

(Sachin) #3

whoops. Yep. Corrected it. Thanks!

0 Likes