NLP for fake news

ricknta · November 26, 2018, 11:29pm

I’m very interested in using fastai/NLP to identify fake news. I’ve been looking for a good dataset to experiment with, but haven’t yet seen anything that’s very comprehensive and current. I wonder though if it really needs to be very current (recent)?

I found this on Quora:

https://www.quora.com/What-are-some-datasets-about-fake-news

and am looking at his top two choices:

Kaggle - https://www.kaggle.com/mrisdal/fake-news
George McIntire - https://github.com/GeorgeMcIntire/fake_real_news_dataset/blob/master/fake_or_real_news.csv.zip

But before I spend much time on this I wanted to check with the fastai community. Any suggestions? Does anyone know of a good dataset with both positive and negative examples?

thanks!

digitalspecialists · November 27, 2018, 9:37am

Kaggle has a fake news competition live in the ‘in class’ section with a 400k size dataset. Of course, one must abide by data usage rules, and it appears these are news headlines rather than content. https://www.kaggle.com/c/fake-news-pair-classification-challenge

ricknta · November 30, 2018, 1:17am

Thanks RobG - I wasn’t aware of that. Not exactly what I was looking for but interesting and I’ll look farther into it.

bachir · December 2, 2018, 5:23pm

@ricknta I was able to train fake news classifier (with 0.949447 accuracy) using the sample approach we had with IMDb reviews.

Here are the deails

ricknta · December 5, 2018, 1:45am

@bachir Very interesting! See this thread for problems I ran into when I initially tried using
TextList.from_csv:

If you read down to the latest posts from today, you can see that it appears that some of the problems I had were caused by running on Gradient. What platform did you run on?

bachir · December 5, 2018, 9:48am

@ricknta I was running into similar problem with the block api, what I usually do to locate the problem is separate the chain of commands into separate lines, like:

data_lm = TextList.from_csv(....)
data_lm = data_lm.split....
data_lm = data_lm.label....
....

I’ve been using colab from the beginning of the course, it’s not a serious solution but OK for most use cases. However, I realized that surprisingly NLP is more greedy for GPU than image stuff and as result colab kernel all the time at the edge of crashing!

akshay827 · February 16, 2019, 2:20pm

It seems GeorgeMcIntire’s dataset has been removed or has been moved somewhere else. Can you update the source link for that dataset?

ricknta · February 17, 2019, 2:20am

Yes I noticed it disappeared recently, but I included his dataset in my github repo. Here’s the repo with an initial nb and the dataset: GitHub - ricknta/fake-news: Simple binary classifier for fake news articles using fastai

I did run a model using his ds and posted some observations about it: