Validation error lower than training error

I noticed in lesson 1 generally the validation losses were lower than the training losses. wouldn’t we expect the opposite? shouldn’t the model fit the training data (that it has seen) better than it fits the validation data (which the model hasn’t seen)?

Thanks

1 Like

Is the training error calculated after each iteration or after the epoch? If it’s after each iteration, what you observed happens almost always in the beginning when network is training because evaluation is done after end of epoch but training error is calculated during the epoch

Absolutely, we would typically expect the training loss to be higher than the validation loss. The reason why it is different is because the validation set does not use dropout. Diving into the code in model.py, we can look at fit (where all the action happens), model_stepper.reset(True) sets the model in training mode (ie dropout is enabled). However when we get round to the validation, if we go into validate, the stepper.reset(False) sets the model to eval mode, and disables the dropout
def validate(stepper, dl, metrics):
batch_cnts,loss,res = [],[],[]
stepper.reset(False)
with no_grad_context():
for (*x,y) in iter(dl):
preds, l = stepper.evaluate(VV(x), VV(y))
if isinstance(x,list): batch_cnts.append(len(x[0]))
else: batch_cnts.append(len(x))
loss.append(to_np(l))
res.append([f(preds.data, y) for f in metrics])
return [np.average(loss, 0, weights=batch_cnts)] + list(np.average(np.stack(res), 0, weights=batch_cnts))

The training error is a rolling average,

        loss = model_stepper.step(V(x),V(y), epoch)
        avg_loss = avg_loss * avg_mom + loss * (1-avg_mom)
        debias_loss = avg_loss / (1 - avg_mom**batch_num)
        t.set_postfix(loss=debias_loss)

avg_mom is a fixed constant of 0.98

Hope this helps! It’s worth looking at the source code. The fit method is a little less readable than it was, but it is all still pretty accessible

3 Likes

The dropout explains it, thanks

Very nice answer. I have one question though : in his lessons, Jeremy seems to seek a model where training loss displayed (calculated with dropout) equals validation loss displayed (calculated without dropout). When that is the case, he calls this models “not underfitted”, and “not overfitted”.
Shouldn’t we call the model a good fit only when validation loss without dropout equals training without dropout too ? Because it’s only in that case that we are comparing comparable things.

In it’s strictest sense, this would be the proper test for over/under fit. In practice, ‘good fit’ is whatever model has the best validation score. The best validation score for NNs is often produced by a model that is quite overfitted.

I’m not quite sure if this does produce optimal results, or is a guideline to aim for. I would postulate the latter. Hope this is of some help!

1 Like

hi @sjdlloyd @karandwivedi42 and @agielchinsky

I am still observing this behavior in Fast AI Version 2.4

Is this still the case?

It is still the case. What are you observing?