Reporting model accuracy using fast.ai neural nets -- why aren't folks reporting test accuracy?

mgc87 · October 22, 2019, 11:24pm

It seems that the ‘gold-standard’ in publishing one’s results for neural networks is to provide accuracy ratings based on a ‘test’ set of data.

In many, if not all, of the examples of classification problems using fast.ai never report accuracy on a ‘test’ set. Rather, the fast.ai examples report the best validation accuracy. Based on the generic train-validate-test workflow for machine learning, it seems that these examples may be showing inflated model accuracy since the ‘test’ step is never implemented.

However, I’m a bit confused when I read the source code for the validation accuracy as it seems that it is indeed making ‘preds’ based on a ‘targ’ (target).

Thus, is fast.ai reporting a defendable ‘accuracy’ metric without taking the extra step of deploying your model on a test sample?

muellerzr · October 22, 2019, 11:33pm

I have reported test accuracy on a test set I make. When we train we don’t want to look at it, only at the very end so we do not have test set bias.Generally I’ll keep that data separate, make a databunch with it, override the validation dataloader, and go from there.

You can read more here let me know if any of that is confusion (I’m on mobile rn)

mgc87 · October 22, 2019, 11:40pm

Thanks Zachary.

Your answer makes sense in the Train-Validate-Test framework. Just seems that it should be part of the workflow for all of the examples, no? Otherwise all of the impressive accuracy reportings may not be representative. Maybe I’m not thinking about this in the right way.

I tried looking at your notebook and Git throws an error stating “Sorry, something went wrong. Reload?” Hopefully this is just a temporary server burp.

ilovescience · October 23, 2019, 12:14am

This often happens on GitHub.
Use nbviewer:

Pomo · October 23, 2019, 6:27pm

A unbiased, independent validation set drawn from the training set will give an accurate assessment of the trained model. But that’s provided that you do not use the validation score to guide training, but apply the same training to the models being compared.

The danger here is that you will tune the model (e.g. hyperparameters) by tracking the validation score, knowingly or innocently. Then the model might perform well on that particular validation set, but not so well on a completely unseen test set.* Thus the call for a test set in addition to validation to rule out such biases.

To calculate validation accuracy, you must compare its predictions with the correct targets - that’s what accuracy means. But the targets are not used to generate the predictions. What exactly is your confusion with the source code?

*This issue shows up in Kaggle competitions where you can probe their unseen test with multiple submissions and tune your model from the scores. I was once “innocently guilty” of this practice, and paid the price when my model was finally assessed on the truly unseen, private test set.

mgc87 · October 23, 2019, 8:37pm

Hi Malcolm,
Thanks for your insight. I agree with your stance on an unbiased validation set – that makes sense.

Say you did little if any hyperparameter tuning during the model dev process. Do you have any practical experience in seeing how of a performance hit is possible on the test set due to ‘innocent’ foresight on the validation scores? I understand the theoretical implication of bias, but am not wondering if this small amount of tuning really moves the needle.

With regard to the validation accuracy, my confusion in the source code arose from some ambiguity over which data were being used to generate predictions. I wanted to ensure no data ‘leakage’ between training and validation sets occurred and thus introducing bias. I.e. does fast.ai handle the validation and training sets to fulfill the ‘independent and unbiased’ confidence in the validation accuracy metric?

Pomo · October 23, 2019, 9:33pm

Do you have any practical experience in seeing how of a performance hit is possible on the test set due to ‘innocent’ foresight on the validation scores? I understand the theoretical implication of bias, but am not wondering if this small amount of tuning really moves the needle.

If knowing the validation loss does not affect your model or training, then it’s benign. I’d say this is a bias to be aware of if you use validation loss to compare models. As for the amount, I was able to jump up 40 places on the Kaggle leaderboard by adjusting my model to improve the test score. But that improvement did not hold for a different test set. In the end, you’ll have to gain your sense of training bias from your own experience.

I wanted to ensure no data ‘leakage’ between training and validation sets occurred and thus introducing bias. I.e. does fast.ai handle the validation and training sets to fulfill the ‘independent and unbiased’ confidence in the validation accuracy metric?

IMHO, there are two choices. 1) Trust that the authors of fastai have handled this issue correctly. 2) Delve into the library code and confirm for yourself that it is handled right.

That said, as a user of fastai, you have to beware of leakage between training and validation that you yourself might create, not the library. Rachel covered some of the issues in her excellent blog post at

https://www.fast.ai/2017/11/13/validation-sets/