Imagenet equivalent dataset for NLP

Is there imagenet equivalent dataset for NLP -

  • wikipedia ?
  • twitter ?
Yes, different ones for different tasks. For language modeling:


But language changes by the epoch (no pun intended)
1800s English is not even close to 2000s English
Is there any source that can identify that?

I’m not aware of a standard dataset, but google books has the source info you’d need. @anamariapopescug know of any good text corpus covering a longer period of time?


Hi, I haven’t personally used these, but I know there’s some work on the “digital humanities” side to provide additional corpora -
(e.g. , ).

Is there a rough estimate of the number of tokens required for building a good language model.

I’m not aware of one. Since AWD-LSTM is so new, I doubt anyone has run these experiments. Would make for an interesting post or paper I think.

For identifying time periods in which a piece was produced, I’d wager a simple bag of words approach would get you quite a long ways. And you could use google’s n-grams dataset, organized by source books’ publication years, to create your labels.

This works really well for author identification.