I was wondering if anyone here has experimented with using FitLaM for sequence prediction problems like Part of Speech (POS) tagging or Named Entity Recognition (NER).
@jeremy did you guys by any chance test it on the above-mentioned/related problems?
Best is to pick a reasonably recent paper covering applications you’re interested in, and use the same datasets as them - so you can see how it compares. For instance, the ELMo paper from AI2 might have good examples, or the COVE paper from McCann et al.
MT-Small: WMT 2016 multi-modal translation shared task [Specia et al., 2016]. The training set consists of 30,000 sentence pairs that briefly describe Flickr captions and is often referred to as Multi30k. Due to the nature of image captions, this dataset contains sentences that are, on average, shorter and simpler than those from larger counterparts.
MT-Medium: 2016 version of the machine translation task prepared for the International Workshop on Spoken Language Translation [Cettolo et al., 2015]. The training set consists of 209,772 sentence pairs from transcribed TED presentations that cover a wide variety of topics with more conversational language than in the other two machine translation datasets.
MT-Large: News translation shared task from WMT 2017. The training set consists of roughly 7 million sentence pairs that comes from web crawl data, a news and commentary corpus, European Parliament proceedings, and European Union press releases.
Thanks a lot, man! This is really helpful. Besides seq2seq, I’m also interested in tasks where an output is generated after each time step, similar to the task of language modelling. I essentially want to be able to port this technique to the Biomedical NLP domain where labelled data is often hard to come by, and can possibly benefit a lot from pre-trained NLP models. One of the important problems there is accurate recognition of named entities like diseases, genes etc., problems which require a prediction for each token.
Btw, here’s a recent paper with 2 “classic” NER datasets (source: I am an NLP person :)). https://arxiv.org/pdf/1707.05928.pdf . But as Rudraksh can confirm, biomed NLP is slightly different - this is just in case you were interested in typical NER data.
Intuitionally, their approach makes a lot of sense. Vocabulary sizes in Biomedical datasets is often large but at the same time, there’s a lot of syntactic similarity between say words used for genes, chemicals etc. A character level CNN should be able to pick up on this.
The kind of applications I’m talking about requires a many-to-many formulation. The network outputs a prediction for each timestep. In addition to the papers mentioned above by @anamariapopescug and me, you can also look at this blog.