Idea: Loss (Function) Finder?

Loss (Function) Finder?

Hey folks. Not sure if this is crazy or who might have already tried it, but here’s an idea that occurred to me…

Purpose

For well-studied problems (e.g., image classification), the choice of loss function may be obvious or mundane. But for new problems, it may not always be obvious which loss function (or combination of loss functions) would work the best. Given automated precedents such as grid search to try out different hyperparameters, Neural Architecture Search (NAS) to try out different architectures and activations, and FastAI’s LRFinder that suggests values for the learning rate, can we create a recommendation engine for which (combination of) loss function(s) are best suited for a given task? For example, Christian Steinmetz’s micro-TCN work recently blew away my SignalTrain model’s results, not only because of architecture changes but also due to a different (better) choice of loss function. If loss-suggestion were to be automated…who knows?

Basic Idea

Given a list of loss functions (e.g. losses=[mse, mae, log_cosh, delta_stft, wasserstein,...]) , see which one decreases by the greatest percentage over a given number of epochs during training.

I can conceive this operating in one of two different modes.

1. “Static Setup” Mode

The would be the first, simplest thing to try. Similar to FastAI’s LRFinder: Starting from the same initial state each time, we loop through a list of loss functions (i.e., for loss in losses:) and run a short training loop for each, making note of the percentage change in the value of the loss function.

(Question: Would it even be meaningful to try to compare log-loss type values with non-logarithmic ones in this way? I don’t know.)

Then the “winning” loss function is recommended to the user, who then starts their “real” training loop using only that loss function for computing gradients. (They may use another function for monitoring.)

2. “Dynamic Hydra” Mode

All loss functions L_j are evaluated concurrently at epoch t (thus denoted as L_j(t)) and their results combined into a total loss function L_T via a weighted sum in which each weight \lambda_j is given by the fractional change over some number \Delta t of recent epochs, something like:

$$ L_T(t) = \sum_j = \lambda_j L_j(t) = \sum_j \left( 1 - {L_j(t)\over L_j(t-\Delta t)}\right) L_j(t)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1) $$

where time t is measured in epochs, and previous values L_j(t-\Delta t) are treated as constant numerical vectors, detached from network graphs and autograd calculations.

This could be run at all epochs during extended training, and might even dynamically adjust which loss function is dominant at different points during training. It also would likely be slow as molasses and may exceed GPU VRAM. To make backpropagation faster, perhaps \lambda_j's that have values smaller than some threshold could be set to zero.

Note that expression for \lambda_j in equation (1) is simplistic, and one might do better to use some kind of “(average) decay rate” expression for \lambda_j. Same basic idea though.

More Questions

  1. Who has already tried something like this (‘this’ being the general idea, not just my ideas of mode 1 or 2)? Given 30+ years of people doing ML,…? Answer: Maybe this NIPS 2018 paper…ish…kinda? otherwise Google’s not helping me find anything similar.

  2. If no one, is that because it’s a bad idea?

  3. Who might have the inclination (besided me) and ability (not me, but learning) to maybe try to implement this, e.g. with fastaiv2? (maybe @muellerzr?) I can work on it just…not fluent with fastaiv2.

  4. If it’s new and turns out to be useful,…where to go next?

  5. Some loss functions kind of “plateau” at first before they really “take off” decreasing. Would Mode 1 not-recommend them? A: Probably would not recommend them, because maybe some other loss function would start decreasing immediately, and might be better anyway.

  6. You realize that equation (1) is just the sum of all the loss functions, minus some weirdly-weighted sum of squares of all the loss functions? A: Yea, is that…bad?

Thanks for reading. I welcome your thoughts.

3 Likes

The idea is certainly intriguing and as you said, given the success of NAS, searching for other hyperparameters could greatly boost performance too. I did find this paper that does exactly what you’re proposing (albeit differently) and it improves the score of already-existing models on various benchmarks.

However, like NAS, it does seem expensive and generally unstable, so not sure whether it’s practical or not. Maybe you could implement it and let us know how it goes?

Have a nice day/evening!

Oooo! Thanks for sharing that, @BobMcDear! I will check that out and see if I can get something to work!

1 Like

The likehood loss would be computed as (0.6)(0.6)(0.9)*(0.9)=0.2916. Since the model outputs proabilities for TRUE (or 1) only, when the ground truth label is 0 we take (1-p) as the proability. in other words, we multiply the model’s outputted proabilities together for the actual outcomes.

No problem! Funnily, this just came out and seems like the most comprehensive and promising of the bunch. Having skimmed through it, I believe certain basic mathematical functions (addition, pooling, etc.) are combined through AutoML to give a robust loss function. It is more generic and can be applied to object detection, segmentation, and more.

The results seem encouraging and it crushes traditional loss functions like cross entropy, dice loss, etc. in many cases, but is on par with other loss functions derived via grid search (unlike other methods though, it’s more generic).

Given its recency, I unfortunately wasn’t able to find an implementation. Might do it myself in the (near) future.

Cheers!

1 Like

Wow, that is awesome! Thank you for sharing that.
I’m so encouraged to learn that it wasn’t an obviously-stupid idea, and also that someone else has already done a lot of the hard work! :wink:
Amazing that that came out the day after I posted this question!

1 Like

Yes, I was pleasantly surprised when I saw the paper published the day after your question!

And it’s by no means a stupid idea: Although currently, in my modest opinion, AutoML hasn’t been able to live up to its hype, I do believe most choices regarding neural network and hyperparameter design are going to be automated or at the very least strongly augmented through searching (whether it’s reinforcement learning, genetic algorithms, or something else) in the near future.

Good luck!

This works really well for us, thank you!