I’ve been testing with a Fake news dataset based on the Lesson 3 IMDB nb. I’m using the fake_or_real_news.csv file that was prepared by George McIntire in 2017 - see https://opendatascience.com/how-to-build-a-fake-news-classification-model/. The csv file was on GitHub just two weeks ago but now is gone - I don’t know why. I’ve reached out to George for feedback.
The dataset has about 6,000 items, consisting of the headline and text of a news article and the label, FAKE or REAL.
As with IMDB, I only got about 35% accuracy for the language model. But the classifier got to 99% in one epoch:
learn_c.fit_one_cycle(1, 1e-2, moms=(0.8,0.7))
Total time: 11:14
epoch train_loss valid_loss accuracy
1 0.241235 0.061616 0.991318
Just to see what would happen, I went ahead with unfreezing 2, 3 and then all layers, but the highest accuracy was still at 99%, although losses were quite a bit lower:
learn_c.unfreeze()
learn_c.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))
Total time: 28:55
epoch train_loss valid_loss accuracy
1 0.120577 0.029395 0.990529
2 0.089659 0.045305 0.984215
I would appreciate some opinions on whether this is too good to be true! I haven’t done enough analysis yet to see if there’s something off here. Seems to be consistently underfitting but I don’t know if that’s a problem.
I tried a few predictions, starting with short samples (a sentence or two) but it turned out it was consistently classifying short samples as Fake. My “aha” moment came when I realized that makes sense - the model needs enough text to classify, from the language patterns alone, something that’s not at all obvious to an algorithm, even though it’s obvious to a human (like “hillary clinton accepted the democratic nomination for president thursday night”). It seems to work well with about 1,000 characters, but I haven’t analyzed that issue either - I’m eager to move on to a multi-class dataset.
I’ll post the nb on GitHub later.