2 Different training strategeis reach same acc of val set but different accuracy on test set

Hi everyone,
I am working on a binary classification problem, and I encountered something during my experiments, and I wanted to get your insight on this. I ran 2 experiments with different sets of data augmentations while keeping lr, batch size, same and both reach 98% accuracy on validation set but reach different accuracy on test set. 90% and 76 %. My validation set consists of random 20% split from training data and my test set consists of 50% from training distribution and 50% from an unseen deployment distribution. Any insights on how to interpret this? Also it appears that i am indirectly optimizing for test set which is kinda against general wisdom i.e use val set for optimizing hyps and test set for evaluating performance. Am I doing something wrong here?

Others with more experience may have more actionable insights but I’ll give my two cents here. My initial reactions to reading this:

  • Your validation set, which is chosen at random does not represent your test set.
  • You may want to consider analyzing your test set (for sure looking at most if not all of the items individually, if feasible, and any group statistics you want to calculate) and understand what that test set consists of.
  • Instead of randomly selecting your validation set, you may want to manually curate it so that it represents more accurately features that would are present in the unseen deployment situation.
  • The data augmentation that performs better on the test set is the one that you should use to train your model.
  • Since you are using your test set to modify your training/validation process, you should create a new unseen test set to evaluate any improvements you make.

Thanks for sharing this interesting use case!