Yesterday I had the occasion of reading quite an interesting medium post about current state-of-the-art pretrained NLP architectures: https://towardsdatascience.com/transfer-learning-in-nlp-for-tweet-stance-classification-8ab014da8dde
The author compares ULMFiT vs GPT (both Pytorch-based) vs MITRE (an older ad-hoc model) over tweets stance classification (stance classification is quite a different task w.r.t. sentiment classification, as duly explained in the article).
Looking at the result, I felt myself somewhat at odds. Indeed, ULMFiT was outperformed (in some cases heavily) not only by GPT but even by MITRE (2016). The author provides his considerations:
Looking at these results, it is once more clear that OpenAI GPT clearly outperforms ULMFiT on most topics […]
It is well known that ULMFiT produces state-of-the-art accuracy on various text classification benchmark datasets such as IMDb and AG News. The common theme across all these benchmark datasets is that they have really long sequence lengths (some of the reviews/news articles are hundreds of words long), so clearly, the language model used by ULMFiT fine-tunes really well for long sequences. Tweets, on the other hand, have a hard-limit of 140 characters, which are rather small sequences compared to full sentences from a movie review or a news article.
As described in the ULMFiT paper, the classifier uses “concatenated pooling” to help identify context in long sequences when a document contains hundreds or thousands of words. To avoid loss of information in case of really long sequences, the hidden state at the last time step is concatenated with both the max-pooled and mean-pooled representation of the hidden states, as shown below.
So I’d like to know Jeremy’s opinion about that.
Leaving apart MITRE, who is ad-hoc and was trained over a massive amount of tweets, is has to be said that GPT was pretrained on Google Books. As far I can undestand (but I understand very little), the culprit could be:
Concatenated pooling, which could be inefficient in extrapolating context from shorter text chunks.
Pretraining on wiki103. I don’t really know the subset of google books upon which GPT was pretrained, but I could hypothesize the books provide much richer language examples than wikipedia. Wikipedia articles are written more or less in the same fashion, while with books you can have a range between super-formal parlances and ordinary conversations (up to kid adventure books).
I think that if (I said if) ULMFiT has got issues in handling short text, they should be fixed, since classifying short text is a task of paramount importance in NLP. [That said, I obtained an accuracy of 97.6% in classifying facebook post in italian, finetuning on a corpus of 25k posts (fb posts are longer, in the average, than tweets, but admittedly not by much)].