Language Model accuracy is unusually high

Training a language model and the accuracy is unusually high … so much so that I can’t help but feel that something is off, but I can’t ascertain what it is.

Any ideas what might be causing this?

I also notice that I can never get the model to start overfitting. Vocab is of size 6,212 and the # of total tokens is 393876.

I would do a prediction on some text and look at the actual text that is generated vs the original text. If it gets about 9 out of every 10 words correct, then your accuracy would be believable. On the other hand, if the predictions are all wrong then something is wrong somewhere.

Thanks for the reply @machinethink

Checked the accuracy and it was ~ 93%. Check the actual predictions and they looked more like 20-25%. Had a glass of scotch, took a break, came back to the notebook and noticed this gem of a typo:

trn_toks = np.load(LM_PATH/'tmp'/f'trn_toks{corpus_suf}.npy')
val_toks = np.load(LM_PATH/'tmp'/f'trn_toks{corpus_suf}.npy')

… and then it all made sense :slight_smile:

If you are evaluating your training set based on the training set you are going to get amazing accuracy, indeed, way too amazing.

3 Likes

Was about to suggest you look at that. If it looks too good to be true it almost always is a train/test leakage (or overlap) issue.

1 Like