Thanks Jeremy, that makes sense, so I went back and read more text samples, since I recalled seeing some where the source was mentioned. There’s no field for the article source, and the model only sees the article text, so I looked for the news source (or something related) showing up in the text. Turns out there aren’t many of those - maybe 5%. Most are “clean” in not having any markers that I could see.
But I looked deeper into the dataset heritage and it was (in my opinion) a bit flawed, partly because it was labelled based on the article source (website) and not on the actual content. McIntire used a Kaggle fake news dataset (https://www.kaggle.com/mrisdal/fake-news) from 2016 for his ‘fake’ records and AllSides (https://www.allsides.com/unbiased-balanced-news) to scrape articles for ‘real’ records. The Kaggle ds was in turn based on a Chrome add-in, BS Detector (https://bsdetector.tech/) which in turn was based on classification of online sources by OpenSources (http://www.opensources.co/) which actually has 12 classes. The Kaggle ds used the most ‘fake’ websites to collect articles with 8 classes, and McIntire simplified the labelling into just FAKE, which I think is a bit misleading (e.g., that includes the ones labeled ‘satire’).
I think it’s misleading to judge that an article is fake strictly because of its source, and reading some ds samples bears that out: there are quite a few records where the article content seemed fine to me (and I’m pretty skeptical in general) but it was classified as fake.
But all that said, it still looks to me like the model works amazingly well at classifying McIntire’s dataset. Because I consider the labeling to be biased and too simplistic, I don’t think this should be used for a real-world application, but it was a good learning experience, and I’m planning to move on to trying the full Kaggle ds. But I would appreciate any other feedback or suggestions!