S: Intuitively, the difference between L1 norm and mean squared error (MSE) is that the latter will penalize bigger mistakes more heavily than the former (and be more lenient with small mistakes).

I want to try to intuitvely understand this concept. I figured having smaller differences between my tensors would lead to my l2 being more lenient, ie, smaller. But I’m unable to repro the same. Could someone help me understand how I could pick a better example to illustrate this concept?

Your example does somehow reflect the concept. One problem is that as the inputs gets smaller, the losses alos become smaller and the effect of the MSE squaring is less pronounced.

If you take:
a = tensor([.1,.2])
b = tensor([.4,.7])
F.l1_loss(a,b) = 0.4
F.mse_loss(a,b).sqrt() = 0.4123
MSE/L1 = 0.4123/0.4 = 1.03

let’s say we change the target tensor to have a bigger loss:

So as you can see, the MSE loss increased 1.28 times the L1 loss for the same change in target. This illustrates that MSE (due to the squaring factor) increases faster than L1, and is more sensitive to outliers.