I was wondering what would happen if I used spacy’s pos tagger to “tag” each token in the imdb corpus then trained a language model and classifier. By adding the pos tags I got up to 55.4% accuracy and 1.919 val loss training a language model after 15 epochs (it was still improving without overfitting but I stopped it for the sake of time) but saw no improvement on the classifier (94.7% accuracy). If I used pos tags alone (no other tokens whatsoever, reducing corpus size from 60k to 76) I could get the language model up to 48.8% accuracy before overfitting (oddly lower val loss at 1.543) but the classifier suffered, only getting to 78.04% accuracy. What I’m wondering is why the pos tagging appeared to improve the language model but did nothing to improve the classifier, my best guess is that the signal the pos tags provide for predicting the next word might not be so helpful for the classifier.
That’s an interesting approach I’ve thought about as well. Thanks for sharing your results.
While using the POS tagger might not have improved the classifier, it definitely looks like it improved your LM. And it would seem reasonable to me that such document vectors might give you improved results on other tasks like identifying semantically similar documents with a KNN.
You might also be interested in the results folks are getting using frameworks like SentencePiece to tokenize (I think there are some posts on the forum). If I remember correctly, the results were pretty good.
It seems to me that POS tagging will only help a classifier where there are words which have multiple potential part-of-speeches, and the classification result will change depending on the POS.
I’m struggling to think of an example where this is a case. I’m sure there are some around, but I suspect they are pretty rare. If this is the case then it isn’t entirely surprising that classification performance doesn’t improve.
I’ll definitely tak ea look at SentencePiece, thanks!
I figured POS tags would add exactly that context for something like a review written by an average person and posted on a website, since they aren’t exactly edited and can be ambiguous, and might boost the performance of the classifier because of that. The weird thing (to me at least) is that the POS tags alone were actually enough information for the classifier to get to 80% accuracy with a vocab of 76. I haven’t done any comparison to see how accurate the actual POS tagging might be. Chances are there’s just way more information in a 60k token vocab vs. a 76 token vocab.
Does anyone have any update on doing Pos tagging as a downstream task on ulmfit.
Recently I trained ulmfit on an underrepresented langauge.
However for fine-tuning in a downstream taak I didn’t have any sort of text clasificat dataset. Butbo have pos tagging dataset. I wanted to fine tune ulmfit and use it as a classifier for pos tagging. But unfortunately rastai doesn’t support pos tagging.
Any helo would be highly appreciated.