Why is training accuracy per epoch so much higher than the validation accuracy?

cudawarped · March 9, 2018, 4:27pm

Hi,
I wanted to investigate what the average training loss displayed by the fit function, equates to in terms of accuracy and loss per epoch. For this comparison I used the seedlings data set, with 80% of the data in the train folder and 20% in a valid folder.

To calculate the training accuracy on this I first ran 5 epochs of learn.fit() with the training and validation set as

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="valid")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)

The output from fit with the validation containing 20% of the original data is

epoch      trn_loss   val_loss   accuracy                                      
    0      1.496924   1.013778   0.642708  
    1      1.051771   0.805401   0.727083                                      
    2      0.831304   0.726356   0.747917                                      
    3      0.729012   0.672901   0.76875                                       
    4      0.644722   0.680163   0.758333

I then set val_name=“train” so that both the training and validation sets were the same to investigate the training loss and accuracy at the end of each epoch as

data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="train")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 5)

The output from fit when setting the validation set to be the same as the training set is

epoch      trn_loss   val_loss   accuracy                                      
    0      1.534358   0.783512   0.757999  
    1      1.082118   0.563303   0.833817                                      
    2      0.866195   0.453154   0.870945                                      
    3      0.73888    0.387739   0.877195                                      
    4      0.655201   0.339189   0.901414

I think I have an error in my understanding because the validation loss and accuracy reported when val_name=“train” is much much higher than on the validation set for all corresponding epochs.

I know I the results at epoch 0 for both runs will have differences but it looks like the training accuracy at epoch 0 as ~76% and the validation accuracy as ~64%, or am I missing something?

I alternatively tried running a single epoch as

arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz),trn_name="train",val_name="valid")
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 1)

with the following output

epoch      trn_loss   val_loss   accuracy                                      
    0      1.504691   1.008132   0.65

and then tried to calculate the accuracy on the training set as

log_preds, y = predict_with_targs(learn.model,learn.data.trn_dl)
probs = np.exp(log_preds)
preds = np.argmax(probs, axis=1)
sum(preds==y)/y.size

again resulting in ~76%.

I think I have an error in either my understanding of the training and validation loss and/or how they are calculated. Can anyone point me in the right direction?

Thank you

davecazz · March 10, 2018, 9:32pm

it might have to do with dropout being applied to the training set but not the validation set.

normally dropout is not applied to validation.

cudawarped · March 11, 2018, 5:55pm

Thanks Dave, that makes perfect sense. Do you know if there is an easy way to calculate predictions without removing dropout?

davecazz · March 11, 2018, 6:36pm

Not sure, you would probably have to dig through the code to see if there is a way.