We need to define first what we mean by “best.” We define this precisely by choosing a

loss function, which will return a value based on a prediction and a target, where lower values of the function correspond to “better” predictions. For continuous data, it’s common to usemean squared error:

The above quote is from chapter 4 in the ‘An End-to-End SGD Example’ section.

Why is the mean squared error better for continuous data than the mean absolute difference (L1 norm)? Is it because we don’t expect there to be large mistakes, only small ones, and the mean squared error is more lenient with smaller errors?