Validation Loss VS Accuracy

I thought validation loss has a direct relationship with accuracy, means always lower validation loss causes higher accuracy, but while training a model, I faced this:
Selection_010

How is it possible? Why do we have lower validation loss but also lower accuracy?

5 Likes

It relates to the loss function. If we use mean square error (MSE) as a loss we optimize by reducing the [average squared] distance between our predictions and the true values - not by minimising miss-classification (I’m assuming this is classification). You may get intuition about this from drawing decision boundaries between classes in something like the iris data set (http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html). You may also see how you could move decision boundaries and still have the same accuracy but a wider margin between classes (how loss can improve and accuracy stay the same) - svm and boosting examples often show max margin. Playing with logistic regression with and without outliers in 2D may also help.

Jeremy mentioned an F2 loss function metric (I think that’s what was used - it was related to minimising false positives) - there’s a set of loss functions metrics around those lines that focus on classification accuracy measures.

[edit] - was looking for a better visualization - didn’t find that, but this post speaks to the topic a bit (https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). Looking into the loss functions also gives some intuition about why we typically want relatively balanced class examples in training.

[edit 2] - re: Jeremy’s reply

3 Likes

Nice explanation. One nit:

No that’s a metric, not a loss function. We use cross entropy loss.

1 Like

So should we be happy if the accuracy goes up if the loss also rises?

epoch      trn_loss   val_loss   accuracy                      
    0      0.712461   1.174837   0.692503  
    1      0.606297   1.178657   0.694088                      
    2      0.528136   1.249662   0.700428  

Empirically, accuracy seems like quite a limited measure of quality of predictions. To predict whether an example belongs to some class, our model outputs a number (whatever we put through sigmoid or softmax) between 0 and 1.

To calculate accuracy, we take some arbitrary threshold (0.5 by default) and every prediction above means examples belong to some class and below they don’t. This threshold of 0.5 gets dicey really fast if we don’t have perfectly balanced class (50% positive and 50% negative examples) or if we have multiple classes.

What happens when we have 90% of negative examples and 10% of positive examples? Is 91% accuracy good or bad?

The best interpretation of accuracy goes up and loss goes up imho is: ‘our model is becoming better on doing well on accuracy with whatever threshold we set’.

There are other metrics that take performance of our classifier at different thresholds into consideration, for example area under ROC curve or mean average precision

Validation loss is nice as in some sense it is some measure of how much our predictions differ from what they should be before we put them through the threshold.

2 Likes

I’m getting a really weird sequence of numbers for validation loss vs. accuracy. While loss on train is getting smaller and smaller, loss of validation is fluctuating a lot. At the same time the quantity reported as “accuracy” (which I still don’t know what it is), fluctuates in a small range. I’m training on a set of news articles and below are both the output of fastai builtin function and the output of performing the final classification on train, validation and test datasets calculated by the well-known scikitlearn classification_report function. As you see the accuracy at the very final epoch of training has been reported 0.567500, however, I really get good train,validation and test precision and recall on these datasets as well as the test set.

Building the text classifier
cycles with big learning rate
epoch  train_loss  valid_loss  accuracy
1      0.601414    1.372827    0.560000                                                              
epoch  train_loss  valid_loss  accuracy
1      0.572330    2.343134    0.550000                                                              
epoch  train_loss  valid_loss  accuracy
1      0.625675    15.646263   0.587500                                                              
cycles with mid learning rate
epoch  train_loss  valid_loss  accuracy
1      0.565834    60.196205   0.600000                                                              
epoch  train_loss  valid_loss  accuracy
1      0.548722    58.771461   0.620000                                                              
epoch  train_loss  valid_loss  accuracy
1      0.551044    79.224930   0.583750                                                              
freeze to -2
epoch  train_loss  valid_loss  accuracy
1      0.542036    113.556290  0.587500                                                              
epoch  train_loss  valid_loss  accuracy
1      0.496392    78.955574   0.623750                                                              
epoch  train_loss  valid_loss  accuracy
1      0.469386    111.237091  0.611250                                                              
unfreeze and sliced learning rate
epoch  train_loss  valid_loss  accuracy
1      0.427161    86.719170   0.610000                                                              
epoch  train_loss  valid_loss  accuracy
1      0.435571    162.631317  0.628750                                                              
epoch  train_loss  valid_loss  accuracy
1      0.388422    103.925232  0.631250                                                              
freeze and final cycles for fine-tuning
epoch  train_loss  valid_loss  accuracy
1      0.390428    154.438812  0.610000                                                               
2      0.400485    4.032323    0.663750                                                               
3      0.406729    55.436405   0.607500                                                               
4      0.391934    12.530686   0.576250                                                               
5      0.337642    72.138474   0.606250                                                               
6      0.363133    50.401752   0.611250                                                               
7      0.400870    1.035045    0.577500                                                               
8      0.386779    17.284279   0.607500                                                               
9      0.404209    1.226067    0.648750                                                               
10     0.373572    8.231665    0.567500                                                                
saving the classifier...

Results on training data:
              precision    recall  f1-score   support                        

           0       0.88      0.92      0.90      1200
           1       0.92      0.88      0.90      1200

   micro avg       0.90      0.90      0.90      2400
   macro avg       0.90      0.90      0.90      2400
weighted avg       0.90      0.90      0.90      2400

Results on validation data:
              precision    recall  f1-score   support                        

           0       0.80      0.82      0.81       400
           1       0.82      0.80      0.81       400

   micro avg       0.81      0.81      0.81       800
   macro avg       0.81      0.81      0.81       800
weighted avg       0.81      0.81      0.81       800

Results on test data:
              precision    recall  f1-score   support                        

           0       0.78      0.84      0.81       400
           1       0.82      0.77      0.79       400

   micro avg       0.80      0.80      0.80       800
   macro avg       0.80      0.80      0.80       800
weighted avg       0.80      0.80      0.80       800
1 Like

I found this article quite helpful.

5 Likes

Seems like your learning rate is very high. Trying reducing it by a factor of 10 (or 100) and see if it works.