This is more of a mathematical question than it is about the coding. After reading Fastai’s blog on ADAMW, I was wondering what the correct way of incorporating l1 loss is. The two callbacks (ways) of doing it listed below:
class L1Loss(LearnerCallback):
def __init__(self, learn, beta=1e-2):
super().__init__(learn)
self.beta = beta
def on_backward_end(self, **kwargss):
optimizer = self.learn.opt.opt
for group in optimizer.param_groups:
for param in group['params']:
sign = param.data / (torch.abs(param.data) + 1e-9)
param.data = param.data - self.beta * sign
and second one is to just add it to the loss:
class L1Loss(LearnerCallback):
def __init__(self, learn, beta=0.2):
super().__init__(learn)
self.beta = beta
def on_backward_begin(self, **kwargs):
weights = [torch.abs(v).sum() for k,v in self.learn.model.named_parameters()
if not 'bias' in k]
last_loss = kwargs['last_loss'] + self.beta * sum(weights)
return {'last_loss': last_loss}
I ran the experiments on a colab notebook. It seems the ‘proper way’ is creating more weights closer to zero.
The question I have is, given enough time would both ways end up at the same answer? Or does raw ADAM with any kind of weight penalty push weights in certain directions?