Hi all, I’ve been trying to see if I can apply weight decay to the SGD function from lesson 2, but 've noticed that it’s actually worse when applying weight decay. What I’m doing is generating mini batches of 100 each and then applying them in a loop for SGD, outputting the
def update_normal(x, y, lr=1e-1):
y_hat = x@a
loss = mse(y, y_hat)
if t % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
return loss.item()
def update_wd(x, y, lr=1e-1):
wd = 1e-3
w2 = 0.
y_hat = x@a
w2 += (a**2).sum()
loss = mse(y, y_hat) + w2 *wd
if t % 10 == 0: print(loss)
loss.backward()
with torch.no_grad():
a.sub_(lr * a.grad)
a.grad.zero_()
return loss.item()
n=100
x_new = torch.ones(n,100,2)
x_new[:,0:,0].uniform_(-1,1)
a = torch.tensor([3.,2])
y = x_new@a + torch.rand(n)
### WITHOUT WEIGHT DECAY
a = torch.tensor([-1.,1])
a = nn.Parameter(a); a
without_losses = []
for idx, x_vals in enumerate(x_new):
t = idx
loss_val = update_normal(x_vals, y[idx])
without_losses.append(loss_val)
Above is the modifications I’ve made, I’ve separated them into two functions just to make it clear of the difference. The only other modification (as mentioned) is making 100 batches with 100 samples each. I’ve not looped through the parameters like in the lesson 5 example because this isn’t using a pytorch model! Below are my results with weight decay added
#WITH WEIGHT DECAY
a = torch.tensor([-1.,1])
a = nn.Parameter(a); a
wd_losses = []
for idx, x_vals in enumerate(x_new):
t = idx
loss_val = update_wd(x_vals, y[idx])
losses.append(loss_val)
My output:
tensor(7.4193, grad_fn=<AddBackward0>)
tensor(1.4599, grad_fn=<AddBackward0>)
tensor(0.4672, grad_fn=<AddBackward0>)
tensor(0.1465, grad_fn=<AddBackward0>)
tensor(0.1061, grad_fn=<AddBackward0>)
tensor(0.0980, grad_fn=<AddBackward0>)
tensor(0.0918, grad_fn=<AddBackward0>)
tensor(0.0877, grad_fn=<AddBackward0>)
tensor(0.0898, grad_fn=<AddBackward0>)
tensor(0.0916, grad_fn=<AddBackward0>)
And without weight decay:
### WITHOUT
a = torch.tensor([-1.,1])
a = nn.Parameter(a); a
without_losses = []
for idx, x_vals in enumerate(x_new):
t = idx
loss_val = update_normal(x_vals, y[idx])
without_losses.append(loss_val)
My output when not using weight decay:
tensor(7.4173, grad_fn=<MeanBackward0>)
tensor(1.4525, grad_fn=<MeanBackward0>)
tensor(0.4554, grad_fn=<MeanBackward0>)
tensor(0.1328, grad_fn=<MeanBackward0>)
tensor(0.0916, grad_fn=<MeanBackward0>)
tensor(0.0830, grad_fn=<MeanBackward0>)
tensor(0.0767, grad_fn=<MeanBackward0>)
tensor(0.0733, grad_fn=<MeanBackward0>)
tensor(0.0749, grad_fn=<MeanBackward0>)
tensor(0.0760, grad_fn=<MeanBackward0>)
Why do I get (albeit slightly) worse results when weight decay is applied? Is this because the model in question (linear) is not complex at all, thus it is not helping? Or have I not applied weight decay correctly?