NLP for fake news

I’m very interested in using fastai/NLP to identify fake news. I’ve been looking for a good dataset to experiment with, but haven’t yet seen anything that’s very comprehensive and current. I wonder though if it really needs to be very current (recent)?

I found this on Quora:

and am looking at his top two choices:

Kaggle -
George McIntire -

But before I spend much time on this I wanted to check with the fastai community. Any suggestions? Does anyone know of a good dataset with both positive and negative examples?



Kaggle has a fake news competition live in the ‘in class’ section with a 400k size dataset. Of course, one must abide by data usage rules, and it appears these are news headlines rather than content.

1 Like

Thanks RobG - I wasn’t aware of that. Not exactly what I was looking for but interesting and I’ll look farther into it.

@ricknta I was able to train fake news classifier (with 0.949447 accuracy) using the sample approach we had with IMDb reviews.

Here are the deails


@bachir Very interesting! See this thread for problems I ran into when I initially tried using

If you read down to the latest posts from today, you can see that it appears that some of the problems I had were caused by running on Gradient. What platform did you run on?

@ricknta I was running into similar problem with the block api, what I usually do to locate the problem is separate the chain of commands into separate lines, like:

data_lm = TextList.from_csv(....)
data_lm = data_lm.split....
data_lm = data_lm.label....

I’ve been using colab from the beginning of the course, it’s not a serious solution but OK for most use cases. However, I realized that surprisingly NLP is more greedy for GPU than image stuff and as result colab kernel all the time at the edge of crashing!


It seems GeorgeMcIntire’s dataset has been removed or has been moved somewhere else. Can you update the source link for that dataset?

Yes I noticed it disappeared recently, but I included his dataset in my github repo. Here’s the repo with an initial nb and the dataset:

I did run a model using his ds and posted some observations about it: