I’m very interested in using fastai/NLP to identify fake news. I’ve been looking for a good dataset to experiment with, but haven’t yet seen anything that’s very comprehensive and current. I wonder though if it really needs to be very current (recent)?
I found this on Quora:
and am looking at his top two choices:
George McIntire - https://github.com/GeorgeMcIntire/fake_real_news_dataset/blob/master/fake_or_real_news.csv.zip
But before I spend much time on this I wanted to check with the fastai community. Any suggestions? Does anyone know of a good dataset with both positive and negative examples?
Kaggle has a fake news competition live in the ‘in class’ section with a 400k size dataset. Of course, one must abide by data usage rules, and it appears these are news headlines rather than content.
Thanks RobG - I wasn’t aware of that. Not exactly what I was looking for but interesting and I’ll look farther into it.
@ricknta I was able to train fake news classifier (with
0.949447 accuracy) using the sample approach we had with IMDb reviews.
Here are the deails
@bachir Very interesting! See this thread for problems I ran into when I initially tried using
Also if I try .label_for_lm with the cleaned-up data:
data_lm = (TextList.from_csv(path, 'fake_or_real_news_clean_4000-2.csv', col='text')
#We randomly split and keep 10% for validation
#We want to do a language model so we label accordingly
Then I once again get the error at the previous line (.random_split_by_pct):
During handling of the above exception, a…
If you read down to the latest posts from today, you can see that it appears that some of the problems I had were caused by running on Gradient. What platform did you run on?
@ricknta I was running into similar problem with the block api, what I usually do to locate the problem is separate the chain of commands into separate lines, like:
data_lm = TextList.from_csv(....)
data_lm = data_lm.split....
data_lm = data_lm.label....
I’ve been using colab from the beginning of the course, it’s not a serious solution but OK for most use cases. However, I realized that surprisingly NLP is more greedy for GPU than image stuff and as result colab kernel all the time at the edge of crashing!
It seems GeorgeMcIntire’s dataset has been removed or has been moved somewhere else. Can you update the source link for that dataset?
Yes I noticed it disappeared recently, but I included his dataset in my github repo. Here’s the repo with an initial nb and the dataset:
I did run a model using his ds and posted some observations about it:
Thanks Jeremy, that makes sense, so I went back and read more text samples, since I recalled seeing some where the source was mentioned. There’s no field for the article source, and the model only sees the article text, so I looked for the news source (or something related) showing up in the text. Turns out there aren’t many of those - maybe 5%. Most are “clean” in not having any markers that I could see.
But I looked deeper into the dataset heritage and it was (in my opinion) a bit flawed, …