Kaggle NLP Competition - Toxic Comment Classification Challenge

When you mention CV - is this a cross validation score that you get when training the model on multiple folds? Or are those just results on the validation set made from the test set? (say 20% put aside?)

I implemented Jeremy’s Improved LSTM baseline using the fastai library. I would love to share the code but I think I would need to share it on kaggle and I don’t want to do this. I think sharing the code would make the library a disservice. I had to write quite a bit of code to make it happen and I think there might be a simpler way to implement this but I don’t know how (yet). If I figure it out I will try to share my NB.

Anyhow, if someone would be interested in embarking on a really neat learning journey where in the end you get to train pytorch models using the familiar interface with lr finder and cosine annealing here is all you need to know:

Constructing the data loaders
Additional information and a neat way of using GloVe embeddings

Key fastai functionality:
1.Create a model data object from dataloaders
2. Create a learner from the data object and a pytorch model (nn.Module)
3. Set appropriate crit on the learner.

I realize its vague and apologies if this might not be easy to follow - I am hoping I will overcome again my fear of sharing unpolished code and will make it public after the competition ends :slight_smile: Or if I have the time and will learn how to use the NLP functionality of the fastai library then will try to put something nicer together and share it even before it does :slight_smile:

As I type these words I have the training running using SGDR. Missed the opportunity to train with saving the models at cycle ends for this run, but this is so cool that we can take any model, any dataset and run it with the familiar set of amazing tools :slight_smile:

BTW I can’t comment on keras but the functionality torchtext exposes in those bucket iterators is amazingly flexible, definitely worth checking out.

EDIT: I think PyTorch’s LSTM doesn’t support dropout on the hidden state transition between steps in LSTM. I suspect this is why we have the WeightDrop in rnn_reg.py. But even without it the re-implementation in PyTorch should do comparatively well - few guesses here but I think that Cosine Annealing is quite helpful and I have not even tried snapshotting yet.


A model that I trained gets achieves a loss of 0.049 and accuracy 0.9815 on a 10% validation set. On the LB this gives me 0.9687. I wonder if those results are roughly in the correct ballpark? Should a model with such stats locally achieve a score such as 0.9687?

I am thinking I might have some problems with the pipeline. Do those numbers seem off or is this how a model with acc of 0.9815 should roughly generalize to test set?

EDIT: Just ran the notebook from Jeremy in keras which gets 0.9770 on LB, guessing 0.9687 is okay-ish. Maybe the difference under the ROC curve is caused by the 0.18 difference in accuracy? Or possibly the spacy tokenizer that I use does something weird with the data that doesn’t generalize to the test set as well? Would be good to try with the keras tokenizer but it doesn’t seem to have a method you can call on a string of text as far as I know to break it down into tokens - it seems to go straight to int representation.

EDIT2: I read about the AUC ROC - quite a neat concept. In light of it accuracy at a threshold of 0.5 and the cost losses are not that meaningful. Could be my pipeline is off but likely the model is just not that good.

EDIT3: There was some dropout my model was missing vs the baseline… getting very similar results now though can’t verify on LB as I am out of submissions for the day.

EDIT4: Lo and behold - dropout turns out to be the missing ingredient. The result is lower by only 0.005 on the LB vs what I get running the keras lstm baseline. BTW the training in Pytorch is 5 - 7 times faster then in keras using same batch sizes, but with increased batch size I can get an epoch down to ~ 15 seconds :slight_smile: Haven’t played with this much yet


Trying to implement this (https://www.kaggle.com/yekenot/pooled-gru-fasttext/code) in plain old pytorch, but consistently scoring around .51 in the competition. It’s a simple bidirectional LSTM, that insofar as I can know, mimics the keras example here.

I have no idea what I’m doing that is giving me such miserable results.

I’d love to share my code but I don’t think its the right thing to do as the competition is ongoing. Would love to get this thing scoring around .97-.98, and I’m pretty sure I’m missing something trivial, but what???

Any of you guys come across the same issue and figure out what to modify to get your models working?

It is cross validation score - average score of 10 validation sets (10-fold CV, train set = 10 validation sets). I shared somewhere on the forum how I did it for image recognition tasks.

BTW new GRU is 0.9842 on the public LB and 0.9876 on CV


~0.5 sounds like you might be shuffling your test set or may have some other problem with creating the submission. Exactly 0.5 would be random

I did notice the use of torchtext.data.BucketIterator in your code. It’s fine for validation set and training set, but what it does is it tries to create a batch with texts of similar length (hence reducing the number of padding you need to use). So if you use it for test set, you’ll end up getting shuffled predictions.


For those using fastai for this, you’ll find this much easier using fastai.text instead of fastai.nlp. Bad news is it is undocumented so far - sorry! I’ll try to write something up soon-ish…


The problem was exactly what @radek suggested.

I am using plain old torchtext.data.Iterator for my test dataset, but after looking at the source code for it I noticed that I had to either pass train=False or shuffle=False in the constructor.

@jeremy, I actually was able to use the new fastai.text code for language modeling. Interesting enough, it seemed to train faster and get better results than the fastai.nlp code on the same imdb dataset from lesson 4. However, I couldn’t figure out how to do the sentiment analysis part with the new fastai.text code.


I joined this competition really late - sometime last week. Can’t decide if it is a blessing or a curse that I joined so late :smiley: Having boatloads of fun but the limited time frame makes for some interesting trade offs :slight_smile: And I still would like to review a couple of things before part 2 starts.

Anyhow - wanted to say I will try training the language model from scratch if time permits, fastai way. Would be also interesting to test drive the RNN due to Mr. Merity et al and that we get in the fastai lib :slight_smile: Seems 100 chars is the magic length of a sequence after which training gets really weird with RNNs, wonder how this one behaves.

All the best to everyone also participating in the comp and I have a hard time believing how much fun this competition is!


So I barely know what I’m doing… wow I say this a lot, but it’s true, especially when it comes to machine learning… this was my first kaggle project and I’m going to try to improve my results as I begin to understand more about multilabel classification, NLP, etc. In case there are people interested in seeing some middle-of-the-road/ right-lane/ veering-off-into-the-shoulder type of work, here’s what I’m doing, and l’ll continue to update this repo as I figure out how to improve results:

I would love it if anyone has any tips, questions, or wants to work with me on this.

I’m in 2953rd place!


Hi treyko, could you provide a little more detail on how you modified PoolingLinearClassifier to ouput the sigmoid for six output units? This is the first time I am trying to modify a NN and I am a little lost on how to create the output layer. I looked at the implementation of PoolingLinearClassifier but the tensor indexing/operatrions seem like magic. I looked at the link Question on labeling text for sentiment analysis but I am unsure of how to set up a custom classifier.

Guys… Have you seen current public LB and latest kernels ?.. The competition is becoming a blending competition now…

Very helpful to get started with NLP in PyTorch( PS-:slight_smile: only for beginners)



It’s getting worse day by day. I think people are overfitting on LB. The same submission is used by many.

I am sure that at least up to 100-th position there is exact correlation between local CV and public leaderboard.

Hierarchical Attention Network in keras with Tensorflow backend https://www.kaggle.com/sermakarevich/hierarchical-attention-network


My latest vanilla pytorch code scored 0.9322 using a BiLSTM network.

Like others have mentioned, really need to pay attention at how test dataset are being processed.

In any case, rather happy with the results since training loss was around 0.066, so not much off from the final outcome :slight_smile:

Using this as a reference: http://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/

Prior to this was referring a lot to wgpubs code with little success. I think I managed to fix the randomized prediction but Kaggle score was just around 0.57. Might need more troubleshooting.


My Bidirectional GRU model with fastText embeddings scored 0.9847 on public leaderboard -


I trained few similar models with different pretrained word embeddings and got similar results.

I also tried more complex architectures, but didn’t get any improvements. Any suggestions would be greatly appreciated.