Jeremy, thanks for the suggestion. I read that post on data leakage - interesting; I hadn’t even heard the term before - and it took a while to figure out what he means by “cross validation folds”! I do see where it would be a problem in the scenarios he focuses on (mostly k-fold cross-validation, if I’m reading it correctly) but I didn’t see much there that seems directly applicable to this case, except of course his solution of holding back a validation ds, which I take as gospel and I think is pretty well baked into fastai.
I’ve read a random sample of the dataset article texts (makes for some fascinating reading!) and don’t see anything that might be a marker showing whether it’s ‘real’ or ‘fake’, but maybe I don’t know what to look for? One obvious ‘marker’ might be the words ‘real’ or ‘fake’ in the text, and that definitely occurs, so to eliminate any chance of that causing leakage, I ran with a ‘clean’ df from which I had removed any records where ‘real’ was in the text and the label was REAL, or ‘fake’ was in the text and the label was FAKE. That took the dataset down to about 4800 records (from 6000). It still produced about 98% accuracy.
I previously had run a few more times with the full dataset, and saw a little variability in losses and accuracy, but generally around 98% accuracy, so it seems like removing the ‘real-REAL’ and ‘fake-FAKE’ records didn’t make much difference.
Interesting situation. Maybe fastai is just that good! Although I admit I’m still a little suspicious… Any other factors I should be looking at?