Kaggle NLP Competition - Toxic Comment Classification Challenge

VishnuSubramanian · March 21, 2018, 5:12am

Yes . Thanks for finding it useful .

jamesrequa · March 21, 2018, 6:18am

Huge congrats to @sermakarevich for a gold medal finish!!!

Hopefully, you are planning to do a blog post with full write-up

sermakarevich · March 21, 2018, 7:19am

Thanks @jamesrequa. I am very bad writer but I promise I will find time to describe main ideas. To be short those are:

models diversity: log reg with Bayesian features on words, chars, words+chars, CNN, DPCNN, HAN, GRU, LSTM, Capsule
text cleaning: contractions, substituting words that were not found in crawl embeddings, lowercase, lemmatisation, removing special symbols
local 10 fold cross-validation with OOF train set predictions: score matched up to 4-th digits so we could do blending efficiently and selected the best submission (+12 positions on the private LB)
and long long long list of what did not work

Lessons I’ve learned: think more, read a lot, model little, tune even less.

wgpubs · March 21, 2018, 4:24pm

Congrats @sermakarevich. Really exceptional work and standing given how competitive this competition was.

I only ended about 2,000 or so places behind you so I’m looking forward to learning more from ya. In the meantime, I’ll be toasting you with a glass of some Macallan 12 year tonight

gdc · March 21, 2018, 8:13pm

Hi @wgpubs
now that the competition is over would you mind sharing your code (or maybe just the main nn.Module )? I tried the same approach (among others like replicating lesson 4), basically i tried to replicate some Keras kaggle kernels in pytorch/fastai, but could not get past 0.72 on the leaderboard so there must be something wrong with my usage of pytorch. I suspect that i didn’t correctly connect the lstm to the linear classifier.

did you do a lot of preprocessing, or had to do a lot of hyperparameter tuning to get in the 0.95+ leaderboard ?
thanks a lot!

wgpubs · March 21, 2018, 8:34pm

Absolutely!

Give me a day or two to clean things up and I’ll post a gist here. If you don’t see anything form me by Sunday, ping me.

-wg

radek · March 21, 2018, 9:15pm

This is pure gold, I think Thx for sharing your approach @sermakarevich!

himanshu · March 22, 2018, 1:11am

Congrats everyone.
This was a great learning experience. After being on kaggle for more than an year, I participated in my first kaggle competition and finally got my first medal.
Congrats @sermakarevich for the gold. Keep up the good work.

hiromi · March 23, 2018, 5:36pm

Congratulations, @sermakarevich!!

This competition was definitely challenging and frustrating one for me, and there were moments I felt like “maybe this is not for me” (having a bad cold for the last 2 weeks of the competition did not help at all). But I trust Jeremy’s comment about tenaciousness, perseverance, and stubbornness and back at studying winners’ solutions

Amazing work, everybody!!

wgpubs · March 24, 2018, 4:16am

Here ya go guys and gals …

I’m sure I’m doing plenty wrong and I’m sure there is much improvement that can be made … so, any feedback is greatly appreciated.

gist.github.com

https://gist.github.com/ohmeow/07e86e22292ea43675b096f5bae26e3c

toxic-comment-torchtext-fastai.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%reload_ext autoreload\n",
    "%autoreload 2\n",

This file has been truncated. show original

sermakarevich · March 27, 2018, 8:58am

Presentation from todays speech at my company about this competition: NLP.pdf (3.0 MB) Prohibited for children.

SHAR1 · March 27, 2018, 9:40am

I would love to see your presentation video.

gdc · March 29, 2018, 9:39am

Thanks a lot for sharing!
it turns out what blocked me to 0.7 was a very silly mistake. for some reason, i had convinced myself that kaggle was rounding the predictions to 1 or 0 before scoring. so i did it myself, to be able to tune the threshold. I pretty much tried to change everything in my code - including the whole network -, but i never tried to not round the predictions
submitting the predictions as-is scored in the high-0.9s - still a bit low on leaderboard but your notebook gives me a lot of leads to improve.
so, finally, a bit disappointed to be in ‘top 95%’ but at least i got the deep learning part mostly right and i learnt a lot and got comfortable with pytorch and fastai in the process - which was the goal after all.

out of curiosity, is there a specific reason you didn’t use the RNN_Learner ? (to be able to use the lr_find and fit etc.)

wgpubs · March 29, 2018, 2:25pm

I had two primary goals:

Learn how to build and debug my own pytorch model that also took advantage of the fastai frameworks LR annealing, etc…
See what is the best score I could get with a single model

Glad you figured things out!

faib · July 5, 2019, 11:16am

Found this noteboko by @dromosys and I’m wondering what went wrong here?

https://www.kaggle.com/dromosys/fast-ai-toxic-comment?scriptVersionId=10226955

Why is the accuracy only around ~0.5? Went something wrong during training or inference?
I’d appreciate any pointers!

faib · July 8, 2019, 7:27am

I found the error:
learner.get_preds() requires ordered=True, otherwise they will be ordered by the sampler (from the longest text to the shortest) which doesn’t map to the given order of the test.csv file.