Why is training accuracy per epoch so much higher than the validation accuracy?

Hi,
I wanted to investigate what the average training loss displayed by the fit function, equates to in terms of accuracy and loss per epoch. For this comparison I used the seedlings data set, with 80% of the data in the train folder and 20% in a valid folder.

To calculate the training accuracy on this I first ran 5 epochs of learn.fit() with the training and validation set as

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="valid")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)

The output from fit with the validation containing 20% of the original data is

epoch      trn_loss   val_loss   accuracy                                      
    0      1.496924   1.013778   0.642708  
    1      1.051771   0.805401   0.727083                                      
    2      0.831304   0.726356   0.747917                                      
    3      0.729012   0.672901   0.76875                                       
    4      0.644722   0.680163   0.758333 

I then set val_name=“train” so that both the training and validation sets were the same to investigate the training loss and accuracy at the end of each epoch as

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="train")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)

The output from fit when setting the validation set to be the same as the training set is

epoch      trn_loss   val_loss   accuracy                                      
    0      1.534358   0.783512   0.757999  
    1      1.082118   0.563303   0.833817                                      
    2      0.866195   0.453154   0.870945                                      
    3      0.73888    0.387739   0.877195                                      
    4      0.655201   0.339189   0.901414 

I think I have an error in my understanding because the validation loss and accuracy reported when val_name=“train” is much much higher than on the validation set for all corresponding epochs.

I know I the results at epoch 0 for both runs will have differences but it looks like the training accuracy at epoch 0 as ~76% and the validation accuracy as ~64%, or am I missing something?

I alternatively tried running a single epoch as

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="valid")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 1)

with the following output

epoch      trn_loss   val_loss   accuracy                                      
    0      1.504691   1.008132   0.65  

and then tried to calculate the accuracy on the training set as

log_preds, y = predict_with_targs(learn.model,learn.data.trn_dl)
probs = np.exp(log_preds)
preds = np.argmax(probs, axis=1)
sum(preds==y)/y.size

again resulting in ~76%.

I think I have an error in either my understanding of the training and validation loss and/or how they are calculated. Can anyone point me in the right direction?

Thank you

1 Like

it might have to do with dropout being applied to the training set but not the validation set.

normally dropout is not applied to validation.

Thanks Dave, that makes perfect sense. Do you know if there is an easy way to calculate predictions without removing dropout?

Not sure, you would probably have to dig through the code to see if there is a way.