With a little more work, I get to 84.7% pretty solidly. (F1-score of 0.8 - again, how does that compare to the simple Naive Bayes baseline? I have no idea.)
Confusion matrix looks ok, less
xxunk from spacy by updating to the “tweet vocab.” I don’t speak spanish, but here are some of the top error tweets as well (second image below.) Going to sentencepiece will increase coverage and remove/reduce the
xxunk to nearly zero. That should not be difficult to build out, so I will try that out next (because it is easy!) After that, going to attention based models is probably sensible. Either with BERT for embeddings or TransformerXL (although the text is not very long…)
I am running my Tweet collector for a while longer today. currently have 100k tweets, but they are all small amounts of text, so that is not really very much data. I think with a large enough tweet corpus, we could just train on that from scratch. Any thoughts there? I am open to ideas about what to try out next.
I will clean up the code and put in a repo and then add the link here. That will take a small bit of time, but I will do it so others can replicate results and move forward.