Any tips for dealing with overfitting in language models?

Besides dropout :slight_smile:

I’ve thought about duplicating data or eliminating the validation dataset once a good enough model is achieved. Not sure if these are wise endeavors or if there are better ways (especially with smaller datasets).

Other ideas would be:

  • increasing/decreasing # of hidden layers
  • increasing/decreasing # of activations
  • increasing/decreasing embedding matrix
  • increasing/decreasing bptt

My notes from Geoffrey Hinton Coursera class says:

Ways to reduce overfitting:

  • More data
  • More layers
  • Weight-decay
  • Weight-sharing
  • Early stopping
  • Model averaging
  • Bayesian fitting of neural nets
  • Dropout
  • Generative pre-training

But that’s a class from 2013, so I’m sure there are newer techniques.