Why is test F1 higher than validation F1?

I’m training an ULMFiT text classification model to detect tweet stance. I have 1000 labeled tweets, which I split into 700 train, 150 validation, and 150 test using scikit-learn’s train_test_split, making sure to stratify. This is a single-label multiclass task, with 3 labels: support, oppose, and unclear. Between the 3 topics, there is a 42/55/3 percent split. I have also implemented a custom f1 metric (higher is better) based on scikit-learn’s f1 (I have compared this to FBeta(average=‘macro’, beta=1) and the results are the same). While training, the validation f1 score tends to be around 0.65 at the highest, but when testing, the f1 score is around 0.72. It strikes me odd that test f1 is so much better than validation f1 since it is usually the exact opposite situation, and I see two potential reasons for this:

  1. Because validation and test are only 150 tweets each, there is a lot of random variance and the test set just happens to be easier than the validation set
  2. Dropout is not turned off when evaluating validation metrics at the end of every epoch, and thus the validation f1 should be lower

Obviously, I hope that theory 2 is correct, but I think 1 is more plausible. Any thoughts?

Well I know now that something weird is happening, the actual answer may be a mix between 1 and 2. I tried loading the model back in and predicting the validation set, and that came out to 0.7 F1, which is much higher than the validation F1 scores I was getting during training. I will research more on how fast.ai calculates validation metrics during training