Imagenet equivalent dataset for NLP

mmr · December 7, 2017, 2:20am

Is there imagenet equivalent dataset for NLP -

wikipedia ?
twitter ?

jeremy · December 7, 2017, 2:30pm

Yes, different ones for different tasks. For language modeling: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/

gerardo · December 7, 2017, 5:18pm

But language changes by the epoch (no pun intended)
1800s English is not even close to 2000s English
Is there any source that can identify that?

jeremy · December 7, 2017, 5:21pm

I’m not aware of a standard dataset, but google books has the source info you’d need. @anamariapopescug know of any good text corpus covering a longer period of time?

anamariapopescug · December 7, 2017, 5:42pm

Hi, I haven’t personally used these, but I know there’s some work on the “digital humanities” side to provide additional corpora -
(e.g. https://dig.hum.uu.nl/corpusscraper-works-with-the-corpus-of-historical-american-english-coha/ , https://www2.fgw.vu.nl/werkbanken/dighum/eresources/linguistics/text_corpora.php ).

mmr · December 8, 2017, 8:26am

Is there a rough estimate of the number of tokens required for building a good language model.

jeremy · December 9, 2017, 10:06pm

I’m not aware of one. Since AWD-LSTM is so new, I doubt anyone has run these experiments. Would make for an interesting post or paper I think.

travisleleu · December 11, 2017, 10:29pm

For identifying time periods in which a piece was produced, I’d wager a simple bag of words approach would get you quite a long ways. And you could use google’s n-grams dataset, organized by source books’ publication years, to create your labels.

This works really well for author identification.