I’m working a blog post going over the various pytorch loss functions and I noticed that both Tf and PyTorch include only Mean Square Error (MSE) as a part of the built in library of loss functions instead of the Root Mean Square Error (RMSE).

It is my understanding that using MSE has the benefit of easier to compute (sqrt can be an expensive operation) and weighting outliers more heavily (may not be a benefit). While RMSE has the benefit of being on the same scale as the error metric while also weighting larger errors more. Furthermore, I think MSE is more likely to have numerical issues if the difference between predicted/actual value is very large. I don’t see a major difference in calculating the gradient of either function either.

So what am I missing? Why include L1 Norm (MAE) and not L2 Norm when it seems to have quite a few advantages over MSE?