NLP challenge project

One thing I’m currently testing out is using only the test and train of the Haha data as the vocab. My thought is that if I generate my vocab that way, it will give me the best vocab to use on this competition. Curious to hear thoughts on this.

A few thoughts I have is that by doing it, I may lose other words that aren’t in that limited 30k tweet corpus, but I definitely think you want those words in your vocab so even if you start with the Haha Vocab and then add on the top 30k that aren’t in that vocab to create your full vocab, I think that would probably give you a better model while reducing the xxunk I’m seeing.

This has been a great project to get more comfortable with NLP. Really hoping to churn out a few more results that don’t suck before the end of the competition, but I’m also pretty happy with where I am at the moment since I don’t speak any meaningful Spanish and I can tell you with an 81% accuracy if a Spanish tweet is funny. That’s pretty awesome although I think my next nlp project will hopefully be back in English so I can validate without Google translate!

1 Like

@kevinb I admit I have not tried to read a single tweet yet for this! I also don’t speak Spanish! Like you, I have learned a LOT by building an application and really gotten into the details.

@adilism See you on there! You jumped to first! Very cool!! :medal_sports:

It’s kind of nice because it makes you focus on the process rather than the actual words and if you find something funny. One thing is I will not be able to do as much analytics on the tokens after the fact though so that’s kind of a bummer.

1 Like

@bfarzin thank you!

The score is an ensemble of five BERT models from five-fold cross validation. The models are based on pytorch-pretrained-BERT with a bit of optimisation and regularization on top - a similar pipeline I am using for Toxic comment classification challenge running on Kaggle at the moment.

The pretrained model is BERT-Base, Multilingual Cased, I haven’t done any text preprocesing or language-model finetunning, so there still should be some room for improvement!


I expect if you ensemble BERT and ULMFiT models together you might see an improvement.


Yes, I would expect so too.

In the meantime, unsupervised finetuning of the language model on train+test data, similar to the ULMFiT, pushed the score from 0.807 to 0.815.

1 Like

I had totally forgotten to do the finetune step. I did that and got to 0.808 (enough for your old score, but not enough for the newer one!) I am going back to try and refine my LM training and size to see if I can pass the high score. We’ve all come a long way from first submission!

1 Like

@kevinb See you there also now! So cool!!

1 Like

Thanks @bfarzin! @hiromi and I have been working together on it. We finally added some Bert into our solution and that definitely brought our ensemble up some, but there were a few things that all contributed to our jump. Definitely wish we had some more time and and more submissions, but also glad that this is wrapping up.

1 Like

I wanted to say thank you to this forum and in particular to @kevinb @hiromi and @adilism for moving up on the board over the past few days. It pushed me to do more, learn more and review everything I did and I have a deeper understanding of it all because of that! This is such a great community and I feel so fortunate to have gotten to participate in this competition!


Thank YOU for being our inspiration and motivating us throughout the competition :slightly_smiling_face: It definitely was a great learning experience and I’m happy that it ended on a high note. Like Jeremy says, it definitely gave me the little push I needed to endure the next few months of hair pulling, frustrating ML process!

Hope to see you in another competition!!


That was a lot of fun. Thank you for pushing us as well. It is great to see 3 Fastai teams taking the top 3 positions in classification and regression. I definitely have a much deeper understanding of ULMFiT after the competition. I didn’t think the Language Model would have nearly as much of an impact as it did. I was also really surprised by how well the regressor was able to pick out the 0s vs the non zeros. I was convinced that we should tie the not funny predictions to 0 on the regressor and then Hiromi trained something with all of the data including the 0s and it blew that approach out of the water. Trying blah turns out to be pretty solid advice on something like this. I am planning on writing a blog post with my lessons learned on this competition.


To briefly describe my solution:

  1. Fitting multilingual BERT, training for four epochs and averaging predictions over five folds gave 0.807
  2. Finetuning the LM on train and test data increased the score to 0.815
  3. Weighted average of (2) and @jeremy’s Naive Bayes model scored 0.821, with weights coming from cross validation.

For the regression task I ensembled BERT predictions with a lightGBM model using the same features as in the NB - this also gave a good push to the score. I hope to share the code in the next few weeks.

Thank you guys, and especially @bfarzin, for all the sharing and discussions!


I was also supprised/impressed how much the LM pre-training matter. I did limited LM training to start, and it was ok, but then more LM training + more fine-tuning got me from 0.78 or so to 0.81. I will do at least one blog post on it and also link to my paper in this forum for future reference to anyone that wants to know more.

To summarize my solution:

  1. Build Spanish Twitter Corpus; Sentencepice Tokenizer with byte-pair encoding; Build LM from scratch [15 epochs, about 4 hours]
  2. Fine tune LM with train and test language data [15 epochs, <10 minutes]
  3. Fit output model [5 mins per fold and seed * 5 folds * 20 seeds = 8 hours]
    3a. Fit Classifier model with 5-fold cross validation across 20 random seeds. Select best seed in validation for submission.
    3b. Fit Regression model (fill N/A with zero) and repeat as above (3a)
  4. Ensemble the across the 5 folds for the single best seed (as measured by validation, 5 fold error over the 20 possible seeds) This got me to 0.81

Hey @adilism, do you have a Twitter handle?

Hey Kevin - it’s @adilzhaismailov but I don’t use it that much

1 Like
1 Like

I am curious - have you tried label smoothing and UDA at the end and did they work for you?

I didn’t find you when I searched on Twitter. Can you double check?

I use LabelSmoothing for both the LM and for the classifier. For the classifier, LabelSmoothing allowed me to train with out gradual unfreezing.
I not get UDA working in time. I did try MixUp after the embedding layer and pretty much got the same outputs. I am curious if MixUp would work well for the LM, and want to experiment with that.