Be careful when using loss function value to compare models

I recently stumbled upon not so obvious behavior of Keras model loss function value when trying to find model optimal parameters. The model I was optimizing (Dense CNN) used log loss (binary crossentropy) as a loss function. While doing experiments I noticed that test set Log loss calculated on the model predictions was very different (lower/better) than the one reported by Keras model.evaluate and during training.
After short exploration, I discovered that L2 regularization applied on the model weights was added to the loss function (as it should) and made the values of loss function look worse compared to to the model without L2 regularization added, even though performance of the model with regularization on a pure log loss metric was better!
I didn’t find the way how to make Keras report pure logloss on the test data to make comparisons between models with different levels of L2 regularizations possible so I added a custom Tensor Flow function (metrics=[tf.losses.log_loss]) and used it to visually compare model results.
I’m writing this because I lost several days worth of compute because of this. When I saw that model loss stays above some threshold (0.2 vs my best pure model loss of 0.16) I interrupted kernel, changed model params and started all over.
Hope this helps someone.

1 Like

But this affects only the log loss on the train set, right?

I mean, for validation set the default metrics do not take into account l1/2 regularization / dropout, it is only the train set?

That’s what I thought as well. But no. Dropout is not applied applied at test time, but L2/L1 are applied to validation loss at test time. Which is correct and kind of obvious now. I just wish there was an easier way to display pure logloss as logloss is not available as a metric.