Learn.validate() mismatching my last epoch loss


I’ve trained a network (with bad results, but that’s beside the point here :wink: ). To debug my results, as a consistency check, I ran predict() on the training set and was surprised to get not-so-good results. To further analyze, I ran learn.validate() on the training set, and surprisingly discovered my last epoch loss wasn’t matching by far the value I get when calling learn.validate() with the training set:

my last epoch loss value:

239 0.003875 0.672104 02:29

just after the fit_one_cycle() call, I saved the model:


now loading it back (without changing the data at all):


returns 0.13376646 which is very far from 0.003875
I understand that during training, the training set is shuffled, and we have the backpropagation, but the different seems quite huge?

with the validset, it seems ok though


which returns 0.6721038, quite close.

Maybe something is different after learn.load() but I dont see why. And I wouldn’t expect such a huge difference.

Can someone shed some light about why I get a much worse loss on the training set?

During training you also have weight decay, dropout, batchnorm and such. There is quite a bit of things that are different when you run on inference mode

for this experiment, I dont have dropout, and batch_size = 1

Can you reproduce it with less data? I am also very interested in why this might be happening so I bookmarked this :slight_smile:

The recorder stores the smoothed loss for training dataset, not the average loss over the entire epoch, which is what validate is returning. Running validate on the validation dataset will return the same loss and metrics as last epoch.

If you run learn.validate(dl=learn.data.train_dl) twice you will get two different results for loss and metrics. I think part of the difference is because train_dl applies the random transforms that you trained with during validate unless you set the transforms to be the same as the validation transforms [1]

learn.data.train_ds.tfms = learn.data.valid_ds.tfms

This reduced the variance in validate on the training dataset when I tested it today with imagenette, but didn’t eliminate it. I haven’t figured out why this is happening yet.

  1. If you used presize on images, you might need to remove it too. ↩︎

1 Like

In this case I don’t have any transforms either.

I’ll take a look at the recorder, but I still have a hard time understanding what could make the loss almost 2 (!!??) order of magnitude different, not just a few percent.

I don’t have any fancy stuff, no data augmentation, no dropout, just BN but bs=1.

If you can reproduce it with a single image in the training dataset and single in valid that would make it so much easier to debug.