Changing Criterion During Training Provides Good Results

Hey guys, just wanted to share a phenomenon I’ve noticed on a project I’m working on. It’s a multi-output regression mixed input model and I was exploring different loss functions and I accidentally discovered that if I train my model using F.mse_loss for the first superconvergence cycle and then change criterion to F.l1_loss for the second superconvergence cycle I get much better results than just running one or two otherwise identical F.mse_loss superconvergence cycles. By better results I mean I get much better generalization on my out of sample loss which improves by about 10-12% using this method. Additionally, if i try to from scratch using F.l1_loss the model doesn’t want to learn at all and the results are quite poor.

It’s a fairly simple 2 dense layer model, although there are a ton of embeddings as well.

Has anyone else had this experience? Any ideas what’s going on here? Perhaps just a distributional issue of my output data?


That’s an interesting finding. I remember in the course Jeremy mentioned using L1 a few times instead of mse and how it performs better. It’s interesting that it won’t converge on L1 to start.

When you say out of sample loss, I’m assuming you’re talking about L1 loss in both cases measured through a metric?

So that’s actually the strangest part.

I start with MSE_loss for a super-convergence cycle of like 20 epochs and then evaluate both the total MSE for all 16 outputs on my held out test sample as well as the individual MSE for each of the 16 outputs separately because I actually only care about one of them, and I’m using the rest of the outputs as a sort of regularization technique since all 16 have a known relationship to one another. BTW this 16 instead of 1 has provided much better results than focusing on just the 1 output I want. It somehow forces the network to learn the full signal of the relationship between all 16 outputs.

So anyway, the metrics I’m tracking are the MSE_Loss for the 1 out of 16 outputs I actually care about. Anyway, relative to my benchmark model that I’m using (which just uses the current value of the datapoint as the prediction for the future) using MSE_loss gets me to 80% of the loss of the benchmark model measured through MSE loss, so a 20% improvement on the one output I care about. Then If I change the criterion to L1_loss and run another 20 epochs of super-convergence, the loss drops to 70% of the benchmark -again measured through MSE loss not L1 loss-, so an improvement of another 10% versus the benchmark. I’ve re-ran this several times btw with different lengths of super-convergence, multiple cycles of super-convergence and not using super-convergence at all and using SGD with restart mults instead and this phenomenon holds.

My best guess as to what is going on is some sort of interaction between the distribution of my data and the distribution of the desired outputs. (to be clear my data is still scaled using StandardScaler().) Since MSE_loss punishes outliers far more than L1_loss, using MSE at first is able to handle the consequences of outliers interacting with the randomized weights much more easier than L1_loss can. But then, once MSE has gotten the weights to the neighborhood of the right answer, it reaches an equilibrium between the benefit of focusing on outliers and the cost of such a high outlier focus. So when I switch to L1_loss which can handle such a wide range of values much more gracefully than MSE_loss can, it can “fine-tune” the loss without worrying about the underlying broad distribution of the targets.

Hopefully that makes sense, curious to hear what other make of this.


That’s an interesting finding. I’d played around a little with changing loss functions when language modelling, but I’m surprised the effect is so pronounced here. It’s very strange to me that L1 loss wouldn’t converge but MSE would. Might be worth doing further data explorations.

Agreed. I’m diving deeper into the distributions of each of my series now. At this point I’ve scaled each series by a constant(chosen relative and relevant to each series) and then used the StandardScaler() on all series.

This is very interesting. Thank you. I have not seen papers that switch losses like you are: like a loss training schedule.

I am going to explore it for segmentation.

1 Like

I observed similar phenomenon in dsb2018 image segmentation, as did many others. Namely train at first with crossentropy and later with dice loss. Got much better results.
In my case starting with dice, the model got pretty much nowhere, but after crossentropy cycle dice improved the results quite a bit.


Have you tried huber loss? Seems to be good choice for not punishing outliers too much.


I have not but this seems like a good use case for huber. I’ll try it on it’s own and i’ll try a MSE->huber variant and report back

So pytroch doesn’t look like it has huber loss as a default and I haven’t coded a custom loss function for it yet.

In the mean time I’ve been thinking about ways to address the distributional issues on the pre-processing side of things. I posted another thread on the topic here but I also wanted to ask you guys if you had any thoughts on pre-processing. I would like to be able to take the log of some of my features but can’t with others because they contain many negatives. So I’m wondering if there are any other ideas for how to process these features? Can I take just the log of some but not others? Am I asking for trouble by scaling and normalizing in all different ways?

So turns out huber loss is implemented in pytorch but named F.smooth_l1_loss

In terms of the results, after 2 cycles it basically matches my approach of going 1 cycle with MSE and 1 cycle with MAE.

Right now I’m focusing on addressing possible solutions to my distributional issues that maintains the relationship between the features I’m trying to model.

Hmmm… now you mention it, IIRC @brendan and I also did something similar for the Planet Kaggle comp. I think we found it pretty useful.