I have been using the imdb classifier as a basis for classifying some legal documents. My model is predicting around 75-80% on the validation set and nearly 100% on the training set.
The training/validation set is 50/50 class 0 and class 1. However, when I tried to give the model new documents (using learn.predict) it predicted that 95% of them were in class 1! Perplexed, I ran my original training set (which is 50/50) through learn.predict, and again 90% were in class 1.
I don’t understand how the model could have achieved 100% training accuracy on a 50/50 training set while predicting that almost all of the docs are in one class.
My documents are much longer than the imdb documents, with an average length of 5500 words.