Imagenet equivalent dataset for NLP

Is there imagenet equivalent dataset for NLP -

  • wikipedia ?
  • twitter ?
1 Like

Yes, different ones for different tasks. For language modeling: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/

3 Likes

But language changes by the epoch (no pun intended)
1800s English is not even close to 2000s English
Is there any source that can identify that?

I’m not aware of a standard dataset, but google books has the source info you’d need. @anamariapopescug know of any good text corpus covering a longer period of time?

2 Likes

Hi, I haven’t personally used these, but I know there’s some work on the “digital humanities” side to provide additional corpora -
(e.g. https://dig.hum.uu.nl/corpusscraper-works-with-the-corpus-of-historical-american-english-coha/ , https://www2.fgw.vu.nl/werkbanken/dighum/eresources/linguistics/text_corpora.php ).

1 Like

Is there a rough estimate of the number of tokens required for building a good language model.

I’m not aware of one. Since AWD-LSTM is so new, I doubt anyone has run these experiments. Would make for an interesting post or paper I think.

For identifying time periods in which a piece was produced, I’d wager a simple bag of words approach would get you quite a long ways. And you could use google’s n-grams dataset, organized by source books’ publication years, to create your labels.

This works really well for author identification.

2 Likes