I’m training an ULMFiT text classification model to detect tweet stance. I have 1000 labeled tweets, which I split into 700 train, 150 validation, and 150 test using scikit-learn’s train_test_split, making sure to stratify. This is a single-label multiclass task, with 3 labels: support, oppose, and unclear. Between the 3 topics, there is a 42/55/3 percent split. I have also implemented a custom f1 metric (higher is better) based on scikit-learn’s f1 (I have compared this to FBeta(average=‘macro’, beta=1) and the results are the same). While training, the validation f1 score tends to be around 0.65 at the highest, but when testing, the f1 score is around 0.72. It strikes me odd that test f1 is so much better than validation f1 since it is usually the exact opposite situation, and I see two potential reasons for this:
- Because validation and test are only 150 tweets each, there is a lot of random variance and the test set just happens to be easier than the validation set
- Dropout is not turned off when evaluating validation metrics at the end of every epoch, and thus the validation f1 should be lower
Obviously, I hope that theory 2 is correct, but I think 1 is more plausible. Any thoughts?