Lesson 3-imdb imbalanced class prediction issue

I have been using the imdb classifier as a basis for classifying some legal documents. My model is predicting around 75-80% on the validation set and nearly 100% on the training set.

The training/validation set is 50/50 class 0 and class 1. However, when I tried to give the model new documents (using learn.predict) it predicted that 95% of them were in class 1! Perplexed, I ran my original training set (which is 50/50) through learn.predict, and again 90% were in class 1.

I don’t understand how the model could have achieved 100% training accuracy on a 50/50 training set while predicting that almost all of the docs are in one class.

My documents are much longer than the imdb documents, with an average length of 5500 words.

This doesn’t sound right. Perhaps share the relevant code to see if anyone can spot the problem.