Language Model accuracy is unusually high

wgpubs · July 31, 2018, 4:38am

Training a language model and the accuracy is unusually high … so much so that I can’t help but feel that something is off, but I can’t ascertain what it is.

Any ideas what might be causing this?

I also notice that I can never get the model to start overfitting. Vocab is of size 6,212 and the # of total tokens is 393876.

machinethink · July 31, 2018, 9:44am

I would do a prediction on some text and look at the actual text that is generated vs the original text. If it gets about 9 out of every 10 words correct, then your accuracy would be believable. On the other hand, if the predictions are all wrong then something is wrong somewhere.

wgpubs · July 31, 2018, 6:35pm

Thanks for the reply @machinethink

Checked the accuracy and it was ~ 93%. Check the actual predictions and they looked more like 20-25%. Had a glass of scotch, took a break, came back to the notebook and noticed this gem of a typo:

trn_toks = np.load(LM_PATH/'tmp'/f'trn_toks{corpus_suf}.npy')
val_toks = np.load(LM_PATH/'tmp'/f'trn_toks{corpus_suf}.npy')

… and then it all made sense

If you are evaluating your training set based on the training set you are going to get amazing accuracy, indeed, way too amazing.

Even · August 1, 2018, 4:01am

Was about to suggest you look at that. If it looks too good to be true it almost always is a train/test leakage (or overlap) issue.