Kaggle NLP Competition - Toxic Comment Classification Challenge

Yes . Thanks for finding it useful .

1 Like

Huge congrats to @sermakarevich for a gold medal finish!!!

Hopefully, you are planning to do a blog post with full write-up :slight_smile:

4 Likes

Thanks @jamesrequa. I am very bad writer but I promise I will find time to describe main ideas. To be short those are:

  • models diversity: log reg with Bayesian features on words, chars, words+chars, CNN, DPCNN, HAN, GRU, LSTM, Capsule
  • text cleaning: contractions, substituting words that were not found in crawl embeddings, lowercase, lemmatisation, removing special symbols
  • local 10 fold cross-validation with OOF train set predictions: score matched up to 4-th digits so we could do blending efficiently and selected the best submission (+12 positions on the private LB)
  • and long long long list of what did not work

Lessons Iā€™ve learned: think more, read a lot, model little, tune even less.

22 Likes

Congrats @sermakarevich. Really exceptional work and standing given how competitive this competition was.

I only ended about 2,000 or so places behind you so Iā€™m looking forward to learning more from ya. In the meantime, Iā€™ll be toasting you with a glass of some Macallan 12 year tonight :grinning:

Hi @wgpubs
now that the competition is over would you mind sharing your code (or maybe just the main nn.Module )? I tried the same approach (among others like replicating lesson 4), basically i tried to replicate some Keras kaggle kernels in pytorch/fastai, but could not get past 0.72 on the leaderboard so there must be something wrong with my usage of pytorch. I suspect that i didnā€™t correctly connect the lstm to the linear classifier.

did you do a lot of preprocessing, or had to do a lot of hyperparameter tuning to get in the 0.95+ leaderboard ?
thanks a lot!

Absolutely!

Give me a day or two to clean things up and Iā€™ll post a gist here. If you donā€™t see anything form me by Sunday, ping me.

-wg

3 Likes

This is pure gold, I think :slight_smile: Thx for sharing your approach @sermakarevich!

1 Like

Congrats everyone.
This was a great learning experience. After being on kaggle for more than an year, I participated in my first kaggle competition and finally got my first medal.
Congrats @sermakarevich for the gold. Keep up the good work.

Congratulations, @sermakarevich!!

This competition was definitely challenging and frustrating one for me, and there were moments I felt like ā€œmaybe this is not for meā€ (having a bad cold for the last 2 weeks of the competition did not help at all). But I trust Jeremyā€™s comment about tenaciousness, perseverance, and stubbornness and back at studying winnersā€™ solutions :muscle:

Amazing work, everybody!!

3 Likes

Here ya go guys and gals ā€¦

Iā€™m sure Iā€™m doing plenty wrong and Iā€™m sure there is much improvement that can be made ā€¦ so, any feedback is greatly appreciated.

5 Likes

Presentation from todays speech at my company about this competition: NLP.pdf (3.0 MB) Prohibited for children.

11 Likes

I would love to see your presentation video. :joy::joy:

Thanks a lot for sharing!
it turns out what blocked me to 0.7 was a very silly mistake. for some reason, i had convinced myself that kaggle was rounding the predictions to 1 or 0 before scoring. so i did it myself, to be able to tune the threshold. I pretty much tried to change everything in my code - including the whole network -, but i never tried to not round the predictions :confused:
submitting the predictions as-is scored in the high-0.9s - still a bit low on leaderboard but your notebook gives me a lot of leads to improve.
so, finally, a bit disappointed to be in ā€˜top 95%ā€™ :joy: but at least i got the deep learning part mostly right and i learnt a lot and got comfortable with pytorch and fastai in the process - which was the goal after all.

out of curiosity, is there a specific reason you didnā€™t use the RNN_Learner ? (to be able to use the lr_find and fit etc.)

I had two primary goals:

  1. Learn how to build and debug my own pytorch model that also took advantage of the fastai frameworks LR annealing, etcā€¦

  2. See what is the best score I could get with a single model

Glad you figured things out!

Found this noteboko by @dromosys and Iā€™m wondering what went wrong here?

https://www.kaggle.com/dromosys/fast-ai-toxic-comment?scriptVersionId=10226955

Why is the accuracy only around ~0.5? Went something wrong during training or inference?
Iā€™d appreciate any pointers!

I found the error:
learner.get_preds() requires ordered=True, otherwise they will be ordered by the sampler (from the longest text to the shortest) which doesnā€™t map to the given order of the test.csv file.