Kaggle NLP Competition - Toxic Comment Classification Challenge

jeremy · March 7, 2018, 6:48am

For those using fastai for this, you’ll find this much easier using fastai.text instead of fastai.nlp. Bad news is it is undocumented so far - sorry! I’ll try to write something up soon-ish…

wgpubs · March 7, 2018, 7:37am

The problem was exactly what @radek suggested.

I am using plain old torchtext.data.Iterator for my test dataset, but after looking at the source code for it I noticed that I had to either pass train=False or shuffle=False in the constructor.

@jeremy, I actually was able to use the new fastai.text code for language modeling. Interesting enough, it seemed to train faster and get better results than the fastai.nlp code on the same imdb dataset from lesson 4. However, I couldn’t figure out how to do the sentiment analysis part with the new fastai.text code.

radek · March 7, 2018, 9:19am

I joined this competition really late - sometime last week. Can’t decide if it is a blessing or a curse that I joined so late Having boatloads of fun but the limited time frame makes for some interesting trade offs And I still would like to review a couple of things before part 2 starts.

Anyhow - wanted to say I will try training the language model from scratch if time permits, fastai way. Would be also interesting to test drive the RNN due to Mr. Merity et al and that we get in the fastai lib Seems 100 chars is the magic length of a sequence after which training gets really weird with RNNs, wonder how this one behaves.

All the best to everyone also participating in the comp and I have a hard time believing how much fun this competition is!

creviera · March 9, 2018, 4:34am

So I barely know what I’m doing… wow I say this a lot, but it’s true, especially when it comes to machine learning… this was my first kaggle project and I’m going to try to improve my results as I begin to understand more about multilabel classification, NLP, etc. In case there are people interested in seeing some middle-of-the-road/ right-lane/ veering-off-into-the-shoulder type of work, here’s what I’m doing, and l’ll continue to update this repo as I figure out how to improve results:

I would love it if anyone has any tips, questions, or wants to work with me on this.

creviera · March 9, 2018, 4:37am

I’m in 2953rd place!

bbrandt · March 10, 2018, 11:00pm

Hi treyko, could you provide a little more detail on how you modified PoolingLinearClassifier to ouput the sigmoid for six output units? This is the first time I am trying to modify a NN and I am a little lost on how to create the output layer. I looked at the implementation of PoolingLinearClassifier but the tensor indexing/operatrions seem like magic. I looked at the link Question on labeling text for sentiment analysis but I am unsure of how to set up a custom classifier.

Bodhi94 · March 11, 2018, 4:03am

Guys… Have you seen current public LB and latest kernels ?.. The competition is becoming a blending competition now…

ecdrid · March 11, 2018, 4:35am

Very helpful to get started with NLP in PyTorch( PS- only for beginners)

Link

himanshu · March 11, 2018, 6:24am

It’s getting worse day by day. I think people are overfitting on LB. The same submission is used by many.

sermakarevich · March 11, 2018, 9:20am

I am sure that at least up to 100-th position there is exact correlation between local CV and public leaderboard.

sermakarevich · March 11, 2018, 2:44pm

Hierarchical Attention Network in keras with Tensorflow backend https://www.kaggle.com/sermakarevich/hierarchical-attention-network

hafidz · March 11, 2018, 6:24pm

My latest vanilla pytorch code scored 0.9322 using a BiLSTM network.

Like others have mentioned, really need to pay attention at how test dataset are being processed.

In any case, rather happy with the results since training loss was around 0.066, so not much off from the final outcome

Using this as a reference: http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

Prior to this was referring a lot to wgpubs code with little success. I think I managed to fix the randomized prediction but Kaggle score was just around 0.57. Might need more troubleshooting.

atikur · March 13, 2018, 12:49pm

My Bidirectional GRU model with fastText embeddings scored 0.9847 on public leaderboard -

https://www.kaggle.com/atikur/simple-gru-with-fasttext-lb-0-9847

I trained few similar models with different pretrained word embeddings and got similar results.

I also tried more complex architectures, but didn’t get any improvements. Any suggestions would be greatly appreciated.

VishnuSubramanian · March 14, 2018, 6:30am

Focus more on preprocessing, try 10 fold validation and use pretrained embedding. With these 2 things you should be able to cross a single model score of .986+

sermakarevich · March 14, 2018, 7:39am

Thanks for the hints. Were you able to cross 0.986+ with a single model?

VishnuSubramanian · March 14, 2018, 8:47am

Yes. A Simple Bi-Directional GRU with good preprocessing and fast text embeddings can give you easily .986+ . I am struggling on what to do next. Trying averaging , ensembling but not much success.

sermakarevich · March 14, 2018, 9:14am

So preprocessing is the key. As I tried everything else and best score I got i 0.9852. Everything else (up to 0.9871) is blending based on train set OOF.

VishnuSubramanian · March 14, 2018, 9:23am

For me blending/ensembling gives 3-5 point boost in the 4th decimal . I am also using OOF.

Do you also use the public blend. When added with that , it gives me a light boost.

sermakarevich · March 14, 2018, 9:51am

I think the key here is to blend different models: blend RNN type networks + LR on TfIDF/Bayesian features + Tree Based predictions.

wgpubs · March 14, 2018, 5:22pm

What kind of pre-processing do you recommend? Any kernels we should be looking at.

I have a GRU (pytorch, fastai, fastext) that scores me a 0.976. I’m doing very minimal pre-processing/cleanup and I’m unsure about how to know what to cleanup and what preprocessing should be done.

Btw, my hyperparams:

max vocab size = 30,000
embedding size = 300
GRU hidden activations = 80
max seq. len = 100

Thanks - wg