Is there imagenet equivalent dataset for NLP -
- wikipedia ?
- twitter ?
Is there imagenet equivalent dataset for NLP -
Yes, different ones for different tasks. For language modeling: https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/
But language changes by the epoch (no pun intended)
1800s English is not even close to 2000s English
Is there any source that can identify that?
I’m not aware of a standard dataset, but google books has the source info you’d need. @anamariapopescug know of any good text corpus covering a longer period of time?
Hi, I haven’t personally used these, but I know there’s some work on the “digital humanities” side to provide additional corpora -
(e.g. https://dig.hum.uu.nl/corpusscraper-works-with-the-corpus-of-historical-american-english-coha/ , https://www2.fgw.vu.nl/werkbanken/dighum/eresources/linguistics/text_corpora.php ).
Is there a rough estimate of the number of tokens required for building a good language model.
I’m not aware of one. Since AWD-LSTM is so new, I doubt anyone has run these experiments. Would make for an interesting post or paper I think.
For identifying time periods in which a piece was produced, I’d wager a simple bag of words approach would get you quite a long ways. And you could use google’s n-grams dataset, organized by source books’ publication years, to create your labels.
This works really well for author identification.