Discussion on "Benchmarks becoming stale"

I remember reading it long back maybe on twitter/blog post/paper that metrics/datasets/benchmarks become stale when people have tried them long enough.

For example.

  1. Overfitting on the metric: Optimizing for accuracy can lead to a model getting high accuracy but low f-score, map etc.

  2. Datasets become stale after people have worked with them long enough.

  3. Overfitting on validation set: If we do train-val split properly but just so this so many times, we will eventually overfit on the validation set also. (Which is why we need a test set that we should only look at once)

However, I am not able to find a resource that ties these ideas together. Specifically the quote that “If you are optimizing for a metric, you should not use the same for evaluation of the model”

Can someone please point to some literature?