One thing I’m currently testing out is using only the test and train of the Haha data as the vocab. My thought is that if I generate my vocab that way, it will give me the best vocab to use on this competition. Curious to hear thoughts on this.
A few thoughts I have is that by doing it, I may lose other words that aren’t in that limited 30k tweet corpus, but I definitely think you want those words in your vocab so even if you start with the Haha Vocab and then add on the top 30k that aren’t in that vocab to create your full vocab, I think that would probably give you a better model while reducing the xxunk I’m seeing.
This has been a great project to get more comfortable with NLP. Really hoping to churn out a few more results that don’t suck before the end of the competition, but I’m also pretty happy with where I am at the moment since I don’t speak any meaningful Spanish and I can tell you with an 81% accuracy if a Spanish tweet is funny. That’s pretty awesome although I think my next nlp project will hopefully be back in English so I can validate without Google translate!