How long should i finetune a language model?

Am following the new tutorial on finetuning LM’s for downstream tasks.


The original LM (from wikipedia) is unfreezed, and fine-tuned for only 1 epoch on the new dataset. The accuracy is about 27% (which I guess is the accuracy of the LM predicting the next word only?). If I encrease the number of epochs here the accuracy of the LM increases. If I run it long enough it can get up to + 80%. Does this mean that the LM model is overfitting to my specific dataset? And will this overwrite the weights learned by original LM?

My dataset is rather small with about 1000 samples. Am I better of with only fine-tuning it for a few epochs?