Validation loss lower than training loss

I am redoing some experiments with the cats & dogs (redux) data, and I’ve been observing something a bit weird, which is that my validation loss is often lower than my training loss (and correspondingly, the validation accuracy is higher than the training accuracy). For example, here is a training run with five steps for each learning rate, and validation is ahead of training every step of the way:

[details=Click to see the output] >>>>> regime at lr=0.009999999776482582
Epoch 1/5
718/718 [==============================] - 167s - loss: 0.5972 - acc: 0.9525 - val_loss: 0.3264 - val_acc: 0.9743
Epoch 2/5
718/718 [==============================] - 167s - loss: 0.5927 - acc: 0.9583 - val_loss: 0.3138 - val_acc: 0.9787
Epoch 3/5
718/718 [==============================] - 172s - loss: 0.5325 - acc: 0.9636 - val_loss: 0.3742 - val_acc: 0.9731
Epoch 4/5
718/718 [==============================] - 177s - loss: 0.5416 - acc: 0.9637 - val_loss: 0.3526 - val_acc: 0.9766
Epoch 5/5
718/718 [==============================] - 177s - loss: 0.5292 - acc: 0.9651 - val_loss: 0.3817 - val_acc: 0.9746
>>>>> regime at lr=0.0010000000474974513
Epoch 1/5
718/718 [==============================] - 177s - loss: 0.5069 - acc: 0.9662 - val_loss: 0.2572 - val_acc: 0.9822
Epoch 2/5
718/718 [==============================] - 178s - loss: 0.4951 - acc: 0.9675 - val_loss: 0.3179 - val_acc: 0.9776
Epoch 3/5
718/718 [==============================] - 178s - loss: 0.4664 - acc: 0.9687 - val_loss: 0.3260 - val_acc: 0.9773
Epoch 4/5
718/718 [==============================] - 178s - loss: 0.4775 - acc: 0.9685 - val_loss: 0.3465 - val_acc: 0.9771
Epoch 5/5
718/718 [==============================] - 175s - loss: 0.4629 - acc: 0.9691 - val_loss: 0.3090 - val_acc: 0.9787
>>>>> regime at lr=9.999999747378752e-05
Epoch 1/5
718/718 [==============================] - 175s - loss: 0.4386 - acc: 0.9706 - val_loss: 0.3539 - val_acc: 0.9766
Epoch 2/5
718/718 [==============================] - 175s - loss: 0.4599 - acc: 0.9695 - val_loss: 0.2984 - val_acc: 0.9807
Epoch 3/5
718/718 [==============================] - 174s - loss: 0.4480 - acc: 0.9701 - val_loss: 0.3146 - val_acc: 0.9787
Epoch 4/5
718/718 [==============================] - 171s - loss: 0.4522 - acc: 0.9697 - val_loss: 0.3461 - val_acc: 0.9776
Epoch 5/5
718/718 [==============================] - 175s - loss: 0.4581 - acc: 0.9694 - val_loss: 0.3307 - val_acc: 0.9783
>>>>> regime at lr=9.999999747378752e-06
Epoch 1/5
718/718 [==============================] - 175s - loss: 0.4445 - acc: 0.9706 - val_loss: 0.3168 - val_acc: 0.9792
Epoch 2/5
718/718 [==============================] - 174s - loss: 0.4496 - acc: 0.9696 - val_loss: 0.3562 - val_acc: 0.9766
Epoch 3/5
718/718 [==============================] - 165s - loss: 0.4329 - acc: 0.9710 - val_loss: 0.3510 - val_acc: 0.9766
Epoch 4/5
718/718 [==============================] - 165s - loss: 0.4505 - acc: 0.9700 - val_loss: 0.3160 - val_acc: 0.9792
Epoch 5/5
718/718 [==============================] - 165s - loss: 0.4451 - acc: 0.9703 - val_loss: 0.2921 - val_acc: 0.9807
>>>>> regime at lr=9.999999974752427e-07
Epoch 1/5
718/718 [==============================] - 165s - loss: 0.4405 - acc: 0.9699 - val_loss: 0.3076 - val_acc: 0.9792
Epoch 2/5
718/718 [==============================] - 165s - loss: 0.4344 - acc: 0.9706 - val_loss: 0.3304 - val_acc: 0.9778
Epoch 3/5
718/718 [==============================] - 165s - loss: 0.4531 - acc: 0.9693 - val_loss: 0.3190 - val_acc: 0.9787
Epoch 4/5
718/718 [==============================] - 165s - loss: 0.4383 - acc: 0.9706 - val_loss: 0.3365 - val_acc: 0.9776
Epoch 5/5
718/718 [==============================] - 165s - loss: 0.4646 - acc: 0.9690 - val_loss: 0.3964 - val_acc: 0.9736[/details]

How should I interpret this? Does it just mean that when (randomly) selecting the validation set, it happened to get a selection of ‘easy’ examples (or easier than average)?

2 Likes

You probably have one or more dropout layers in your top level network definition. Dropout layers are only applied during the training phase and not during the validation phase. Consequently the initial loss of the training data can often be higher than the validation loss during the first epochs.

3 Likes

From the keras website:

Why is the training loss much higher than the testing loss?

A Keras model has two modes: training and testing. Regularization mechanisms, such as Dropout and L1/L2 weight regularization, are turned off at testing time.

Besides, the training loss is the average of the losses over each batch of training data. Because your model is changing over time, the loss over the first batches of an epoch is generally higher than over the last batches. On the other hand, the testing loss for an epoch is computed using the model as it is at the end of the epoch, resulting in a lower loss.

5 Likes

Thanks @alexandrecc and @simoneva! Here is the relevant link to the keras doc.

I guess it’s due mainly to turning off Dropout then, because I observe the difference even once the metrics have stabilized.