So that’s actually the strangest part.

I start with MSE_loss for a super-convergence cycle of like 20 epochs and then evaluate both the total MSE for all 16 outputs on my held out test sample as well as the individual MSE for each of the 16 outputs separately because I actually only care about one of them, and I’m using the rest of the outputs as a sort of regularization technique since all 16 have a known relationship to one another. BTW this 16 instead of 1 has provided much better results than focusing on just the 1 output I want. It somehow forces the network to learn the full signal of the relationship between all 16 outputs.

So anyway, the metrics I’m tracking are the MSE_Loss for the 1 out of 16 outputs I actually care about. Anyway, relative to my benchmark model that I’m using (which just uses the current value of the datapoint as the prediction for the future) using MSE_loss gets me to 80% of the loss of the benchmark model measured through MSE loss, so a 20% improvement on the one output I care about. Then If I change the criterion to L1_loss and run another 20 epochs of super-convergence, the loss drops to 70% of the benchmark -again measured through MSE loss not L1 loss-, so an improvement of another 10% versus the benchmark. I’ve re-ran this several times btw with different lengths of super-convergence, multiple cycles of super-convergence and not using super-convergence at all and using SGD with restart mults instead and this phenomenon holds.

My best guess as to what is going on is some sort of interaction between the distribution of my data and the distribution of the desired outputs. (to be clear my data is still scaled using StandardScaler().) Since MSE_loss punishes outliers far more than L1_loss, using MSE at first is able to handle the consequences of outliers interacting with the randomized weights much more easier than L1_loss can. But then, once MSE has gotten the weights to the neighborhood of the right answer, it reaches an equilibrium between the benefit of focusing on outliers and the cost of such a high outlier focus. So when I switch to L1_loss which can handle such a wide range of values much more gracefully than MSE_loss can, it can “fine-tune” the loss without worrying about the underlying broad distribution of the targets.

Hopefully that makes sense, curious to hear what other make of this.