Doubts regarding training loss, validation loss and number of epochs

Hello everyone,

I’ve just started with the course.
After watching lesson 2, I was trying to come up with a model for classifying any given image as Volleyball, Beach Volleyball or Tennis based on which sport is being played in the image.

I tried out a few things things with learning rate, number of epochs and the model being used and observed a few things.
Following are my questions:

  1. I ran learn.fit_one_cycle(4) and after completion of 4 epochs I observed that the training loss is still much higher than the validation loss and the error_rate kept monotonically decreasing in all the 4 epochs. Does it always implies that I didn’t train my model enough and should I increase the number of epochs?
    (As in the image below, with my assumption)

  2. I observed spikes in the Validation loss and error_rate during training.(Image below)
    What does it implies? It might not be the case that during that particular epoch in SGD the model came across a particularly difficult batch of data as even in SGD we scan all the data points in every epoch. Then what could be the reason for the spike?
    Is it the case that global optimization across all the epochs does not always ensures local optimization among any sequence of epochs?

  3. In general, changes in validation loss and error_rate during training should go hand in hand, right? I mean, if error_rate is decreasing between two epochs then validation loss should also decrease and vice-versa. However, I encountered cases where validation loss was increasing while error_rate was decreasing and where validation loss was decreasing while error rate was increasing
    If the validation loss is increasing in the subsequent epochs and error_rate is decreasing then what does it implies?(Images below)
    Does it means, in the case when validation loss increase while error_rate decrease, that our model is making less number of mistakes currently but the mistakes it is making are worse than the one it was making previously? That is are we going from larger number of tiny mistakes to smaller number of blunders? If yes, then is it a matter of concern and what should be the way forward?

  4. In most of the training cases, I observed that the initial training loss is much higher than the validation loss. As we normalize both the losses with number of data points in training and validation sets respectively, and validation set is chosen randomly, should not the initial losses be same for both?

  5. If the error rate gets stable but the training loss is still much higher than validation loss. What does it implies? Does it implies lack of training data?



Your dataset appears to be on the smaller side with 10 batches per epoch, so I wouldn’t be surprised if there was more noise in validation loss and metrics than a dataset with double the images (or more). That being said, your questions are still good questions to be asking.

In general, a larger training loss than validation loss means your model is underfitting the data. Which can be resolved by training longer or picking a higher learning rate. Note the opposite isn’t true. If the training loss is lower than the validation loss, but both are decreasing, the model is not overfitting.

Temporary spikes in validation loss are normal during training. This means the current local minima doesn’t generalize as well as the local minima at the end of the prior epoch. If the validation loss continues to increase, then you are probably overfitting the training data. But a temporary spike in validation loss that disappears with more training is normal.

In general, yes, validation loss and metrics will go hand in hand, but it depends on the loss vs metric. You are using error rate, which measures a binary prediction while Cross Entropy Loss (the default classification loss in is measuring the difference in predicted probabilities from the true label. So you could be looking at a case where the all the predicted probabilities are closer to the true labels, but more images are barely wrong in terms of probability so the binary error rate is worse.

For your other questions, it would depend on the metric you are looking at and why you are monitoring that metric. For example, you might have a imbalanced dataset and accuracy is increasing but the model is just learning to guess one type so the recall score is low. That case would be a concern.

No, the training set has transformations applied to it while the validation set does not. Assuming a representative validation set, there’s a larger difference between images in the training set with transformations applied than the validation set.

Rachel Thomas has a good article on choosing a validation set which I would recommend reading.

Same answer as before: In general, a larger training loss than validation loss means your model is underfitting the data. Which can be resolved by training longer or picking a higher learning rate.

Another case could be that the transformations used are making the training dataset considerably harder than the validation dataset, but given you are probably using close to the defaults, I doubt that is the case here.

Hopefully that helps. Welcome to the forums. Keep asking questions if you have them.


Thanks for your time and explanations. :slight_smile:
I have a few more questions:

Could you please explain why the noise increases in validation set if we have a smaller dataset?

Aprat from adjusting the learning rate, is this also a sign that I should review my model architecture and try increasing number of parameters in my model, e.g. shifting from ResNet34 to ResNet50?

Okay so depending upon the transformations applied, it might be the case that validation set and training set images are very different and so initial loss need not be the same. But it could be other way around also, right? I mean initial training loss could be much less than the initial validation loss in other cases. In general, transformation can increase, decrease or keep the training loss same as without transformation.

Also, to be fair, should not we apply the same set of transformations to our validation and test sets that we applied to our training set before feeding them to our model?

Another question is if I have batch size of 10 data points, training dataset size of 300 data points and 5 number of epochs, then I would be calculating training loss for (300 / 10) x 5 times(= 150) and validation loss for 5 times during the whole course of training, right?


It could increase noise, but doesn’t always happen. A smaller dataset could lead to more noise because a mislabeled example has more effect on training or validation, it might be more difficult to construct a representative validation set, or perhaps there’s not enough data for the model to generalize well.

I don’t think so. That’s usually something you experimentally explore to see if the increased training time of a larger model is worth the potential increase in accuracy.

Transformations, or data augmentation, is a good form of regularization for the model. In vision we apply transformations to an image by warping it, increasing or lowering the brightness, cutting out holes in the image, etc. Some of these transformations make the image harder to classify, which leads to increased loss due to the higher rate of misclassification on the transformed images.

Applying transformations to test or validation images is called test time augmentation and fastai does support it. But it’s used for prediction not during training.

Yes. Training loss is calculated for each batch and is used to update the model weights using gradient descent. While validation loss by default is only calculated at the end of an epoch. Fastai records the smoothed training loss.

1 Like

Okay. Thank you so much. :slight_smile:

Shifting from ResNet34 to ResNet50 would worsen the problem as it could lead to overfitting i.e trying to fit more parameters with less data. Using appropriate augmentation and using Resnet 18 should be tried in my opinion.

1 Like

Thank you! :pray:t4: