Kaggle NLP Competition - Toxic Comment Classification Challenge

(Aditya) #43

Thanks a lot for those tutorials

Just started with NLP…


I just made my first submission to the competition :slight_smile:

Could I please ask what is the best way of preprocessing the submission? Meaning the sigmoid layer of my model will output values between 0 - 1, some closer to the extremes than others. What do people find to be the best way of processing these values? Do we just submit them raw or is it better to output some value close to 0 or 1 (like 0.05 and 0.995) if the output value is above / below 0.5?

(sergii makarevych) #45

Metric is AUC, so scale of probabilities does not matter, only rank.

(Trey Kollmer) #46

Hiromi thank you so much! You’ve been so helpful!

(Hiromi Suenaga) #47

Sure thing. I am also struggling with this competition, but we’ll get there :slight_smile:

(sergii makarevych) #48

Simple example of using sklearn pipelines for this task:

  • create a simple pipeline of default sklearn estimators/transformers
  • create our own estimator/transformer
  • create a pipeline which will process features in a different way and then join them horizontally
  • finetune parameters


When you mention CV - is this a cross validation score that you get when training the model on multiple folds? Or are those just results on the validation set made from the test set? (say 20% put aside?)


I implemented Jeremy’s Improved LSTM baseline using the fastai library. I would love to share the code but I think I would need to share it on kaggle and I don’t want to do this. I think sharing the code would make the library a disservice. I had to write quite a bit of code to make it happen and I think there might be a simpler way to implement this but I don’t know how (yet). If I figure it out I will try to share my NB.

Anyhow, if someone would be interested in embarking on a really neat learning journey where in the end you get to train pytorch models using the familiar interface with lr finder and cosine annealing here is all you need to know:

Constructing the data loaders
Additional information and a neat way of using GloVe embeddings

Key fastai functionality:
1.Create a model data object from dataloaders
2. Create a learner from the data object and a pytorch model (nn.Module)
3. Set appropriate crit on the learner.

I realize its vague and apologies if this might not be easy to follow - I am hoping I will overcome again my fear of sharing unpolished code and will make it public after the competition ends :slight_smile: Or if I have the time and will learn how to use the NLP functionality of the fastai library then will try to put something nicer together and share it even before it does :slight_smile:

As I type these words I have the training running using SGDR. Missed the opportunity to train with saving the models at cycle ends for this run, but this is so cool that we can take any model, any dataset and run it with the familiar set of amazing tools :slight_smile:

BTW I can’t comment on keras but the functionality torchtext exposes in those bucket iterators is amazingly flexible, definitely worth checking out.

EDIT: I think PyTorch’s LSTM doesn’t support dropout on the hidden state transition between steps in LSTM. I suspect this is why we have the WeightDrop in rnn_reg.py. But even without it the re-implementation in PyTorch should do comparatively well - few guesses here but I think that Cosine Annealing is quite helpful and I have not even tried snapshotting yet.


A model that I trained gets achieves a loss of 0.049 and accuracy 0.9815 on a 10% validation set. On the LB this gives me 0.9687. I wonder if those results are roughly in the correct ballpark? Should a model with such stats locally achieve a score such as 0.9687?

I am thinking I might have some problems with the pipeline. Do those numbers seem off or is this how a model with acc of 0.9815 should roughly generalize to test set?

EDIT: Just ran the notebook from Jeremy in keras which gets 0.9770 on LB, guessing 0.9687 is okay-ish. Maybe the difference under the ROC curve is caused by the 0.18 difference in accuracy? Or possibly the spacy tokenizer that I use does something weird with the data that doesn’t generalize to the test set as well? Would be good to try with the keras tokenizer but it doesn’t seem to have a method you can call on a string of text as far as I know to break it down into tokens - it seems to go straight to int representation.

EDIT2: I read about the AUC ROC - quite a neat concept. In light of it accuracy at a threshold of 0.5 and the cost losses are not that meaningful. Could be my pipeline is off but likely the model is just not that good.

EDIT3: There was some dropout my model was missing vs the baseline… getting very similar results now though can’t verify on LB as I am out of submissions for the day.

EDIT4: Lo and behold - dropout turns out to be the missing ingredient. The result is lower by only 0.005 on the LB vs what I get running the keras lstm baseline. BTW the training in Pytorch is 5 - 7 times faster then in keras using same batch sizes, but with increased batch size I can get an epoch down to ~ 15 seconds :slight_smile: Haven’t played with this much yet

(WG) #52

Trying to implement this (https://www.kaggle.com/yekenot/pooled-gru-fasttext/code) in plain old pytorch, but consistently scoring around .51 in the competition. It’s a simple bidirectional LSTM, that insofar as I can know, mimics the keras example here.

I have no idea what I’m doing that is giving me such miserable results.

I’d love to share my code but I don’t think its the right thing to do as the competition is ongoing. Would love to get this thing scoring around .97-.98, and I’m pretty sure I’m missing something trivial, but what???

Any of you guys come across the same issue and figure out what to modify to get your models working?

(sergii makarevych) #53

It is cross validation score - average score of 10 validation sets (10-fold CV, train set = 10 validation sets). I shared somewhere on the forum how I did it for image recognition tasks.

BTW new GRU is 0.9842 on the public LB and 0.9876 on CV


~0.5 sounds like you might be shuffling your test set or may have some other problem with creating the submission. Exactly 0.5 would be random

Load ULMFit model
(Hiromi Suenaga) #55

I did notice the use of torchtext.data.BucketIterator in your code. It’s fine for validation set and training set, but what it does is it tries to create a batch with texts of similar length (hence reducing the number of padding you need to use). So if you use it for test set, you’ll end up getting shuffled predictions.

(Jeremy Howard) #56

For those using fastai for this, you’ll find this much easier using fastai.text instead of fastai.nlp. Bad news is it is undocumented so far - sorry! I’ll try to write something up soon-ish…

(WG) #57

The problem was exactly what @radek suggested.

I am using plain old torchtext.data.Iterator for my test dataset, but after looking at the source code for it I noticed that I had to either pass train=False or shuffle=False in the constructor.

@jeremy, I actually was able to use the new fastai.text code for language modeling. Interesting enough, it seemed to train faster and get better results than the fastai.nlp code on the same imdb dataset from lesson 4. However, I couldn’t figure out how to do the sentiment analysis part with the new fastai.text code.


I joined this competition really late - sometime last week. Can’t decide if it is a blessing or a curse that I joined so late :smiley: Having boatloads of fun but the limited time frame makes for some interesting trade offs :slight_smile: And I still would like to review a couple of things before part 2 starts.

Anyhow - wanted to say I will try training the language model from scratch if time permits, fastai way. Would be also interesting to test drive the RNN due to Mr. Merity et al and that we get in the fastai lib :slight_smile: Seems 100 chars is the magic length of a sequence after which training gets really weird with RNNs, wonder how this one behaves.

All the best to everyone also participating in the comp and I have a hard time believing how much fun this competition is!

(Allie Crevier) #59

So I barely know what I’m doing… wow I say this a lot, but it’s true, especially when it comes to machine learning… this was my first kaggle project and I’m going to try to improve my results as I begin to understand more about multilabel classification, NLP, etc. In case there are people interested in seeing some middle-of-the-road/ right-lane/ veering-off-into-the-shoulder type of work, here’s what I’m doing, and l’ll continue to update this repo as I figure out how to improve results:

I would love it if anyone has any tips, questions, or wants to work with me on this.

(Allie Crevier) #60

I’m in 2953rd place!

(Benedikt Brandt) #61

Hi treyko, could you provide a little more detail on how you modified PoolingLinearClassifier to ouput the sigmoid for six output units? This is the first time I am trying to modify a NN and I am a little lost on how to create the output layer. I looked at the implementation of PoolingLinearClassifier but the tensor indexing/operatrions seem like magic. I looked at the link Question on labeling text for sentiment analysis but I am unsure of how to set up a custom classifier.

(Vibhutha Kumarage) #62

Guys… Have you seen current public LB and latest kernels ?.. The competition is becoming a blending competition now…