Kaggle NLP Competition - Toxic Comment Classification Challenge

amritv · March 15, 2018, 5:09am

@jamesrequa thanks for this. Ill give it a shot.

I cant use the whole data set as I keep on running out of memory so I used code to break the training and testing datasets into 5 chunks and then training each chunk (approx approx 35000 rows) and then saving the output predictions. I hence have 5 prediction files and was using the method above to join them all into 1 file.

Were you able to use the whole dataset without running out of memory or were you also using smaller chunks?

sermakarevich · March 15, 2018, 6:18am

Lol, anything you touch turns into 0.9865+ ))

Thanks for the hint once again.

jamesrequa · March 15, 2018, 6:22am

@amritv I was able to train with all of the data. What GPU are you using?

Do you mean that you trained on each class separately? Or you just split it up randomly?

If you split training data randomly into 5 chunks and still predicted on all of the test data (all classes) then it won’t affect the shape of your test predictions and you could still use the reference code I provided before. Personally, I don’t think you should predict on only part of the test data each time (if you aren’t splitting by class) unless you ran into memory issues with test data as well, but I don’t think that should take too much memory to just run predictions.

If you split training and generated predictions separately by class then you could still create a new submission file from scratch without any concatenations. If you are having memory issues then I highly recommend using bcolz to save and reload your prediction arrays.

Below is an example of how that process might look.

Use the below bcolz functions for saving and loading prediction arrays.

import bcolz

def save_array(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode=‘w’); c.flush()
def load_array(fname): return bcolz.open(fname)[:]
You’ll want to save pred arrays each time after you run predictions on each chunk
save_array('toxic_preds.bc', toxic_preds)
Once you have finished saving all 5 prediction arrays then reload them all back into one submission file. You could run this in a new notebook and make sure the training notebook is shutdown to save memory.

test_ids = pd.read_csv(’./input/sample_submission.csv’).id.values
columns = [‘id’,‘toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’]
submission = pd.DataFrame(index=range(0,len(test_ids)), columns=columns)
submission[“toxic”] = load_array(‘toxic_preds.bc’)
submission[“severe_toxic”] = load_array(‘severe_toxic_preds.bc’)
submission[“obscene”] = load_array(‘obscene_preds.bc’)
submission[“threat”] = load_array(‘threat_preds.bc’)
submission[“insult”] = load_array(‘insult_preds.bc’)
submission[“identity_hate”] = load_array(‘identity_hate_preds.bc’)
submission.to_csv(‘submission.csv’, index=False)

Or if you just wanted to average your test predictions and hopefully get better overall predictions you could still use bcolz for that to save the arrays separately first and then reload them back in to take the average before saving the final average predictions into one submission file.

sermakarevich · March 15, 2018, 7:20am

May I ask what is your local CV score in this case and what number of folds?

VishnuSubramanian · March 15, 2018, 7:35am

I am using 10 folds. The average roc on 10 fold is 0.9910133397

SHAR1 · March 15, 2018, 7:43am

Wooaa! you are killing it.

himanshu · March 16, 2018, 1:33am

Are you using the capsnet from kaggle? May I ask what config are you using?

VishnuSubramanian · March 16, 2018, 2:02am

I tried the same config as was there in the kernel. I tried playing with different parameters , but nothing improved my single model score.

amritv · March 16, 2018, 6:08am

@jamesrequa - extremely helpful! I really appreciate your in depth solution. I was finally able to get submissions in.

wgpubs · March 16, 2018, 4:31pm

Finally cross theshold 0.98 threshold with a single model (pytorch LSTM with fasttext embedding using fast.ai goodness to take advantage of SGDR, etc…).

I’ve done minimal pre-processing and curious what folks here are doing with their comment text. In particular, are you …

Removing stop words?
Using lemmatization in your tokenization?
What are you removing in your cleanup and/or tokenization process? For example, I’m doing my best to remove URLs, usernames and replace emojis with their textual equivalent.

Also, I’ve found thus far that a bi-directional LSTM outperforms a bi-directional GRU (all else being the same). Thought that was interesting.

Good luck to all in the competition!

sermakarevich · March 16, 2018, 7:59pm

I tried:

lemmatize
morphy
different tokenisations
removing special symbols
contractions processing
removing digits
lowercase
mapping words that were not found in embedding dictionary to top 100 positive / negative words based on LR weights

Nothing worked ). I changed % of unknown words from 1.23 to 0.73 but accuracy did not change.

sermakarevich · March 19, 2018, 5:07pm

Any suggestions on the final day of the competition ? )

wgpubs · March 19, 2018, 5:18pm

Best SINGLE model (a simple bidirectional LSTM using fast.ai for lr annealing and SGDR) using a 10 fold CV dataset gave me: .9814

Interesting notes:

Score improved by 0.002 by NOT using a pre-trained embedding (was using fasttest300).
Tried training using AdamW (using theWeightDecaySchedule callback) but it was worse than just using Adam optimizer with wds = 1e-5

sermakarevich · March 19, 2018, 6:28pm

Finally I managed to improve score of a single model with significant cleaning:

if word in emb vocab: cleand_comment.append(word)
- if word.lowercase() in emb vocab? cleand_comment.append(word.lowercase())
  - for w in sorted(emb_vocab, from long to short): if w in word; word = word.replace(w, ‘’); cleand_comment.append(w)

wgpubs · March 19, 2018, 6:37pm

Not sure I understand what you are doing here (any elaboration you can provide would be helpful), but agreed that cleanup improves score.

Would love to see how folks determine WHAT to cleanup based on the particular corpus they are looking at.

sermakarevich · March 19, 2018, 6:55pm

wgpubs · March 19, 2018, 7:36pm

Good stuff @sermakarevich.

Love how you structured your experiments and a very interesting approach to cleaning up the corpus.

creviera · March 19, 2018, 8:10pm

Thanks for sharing your notebook! It’s nice to be able to run the code while learning how you approached the problem.

I created a PR (https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge/pull/1) in case others want an easy way to install all the notebook’s dependencies using pip install -r.

alwc · March 21, 2018, 2:44am

Hi everyone, I hope whoever participated in the competition did well on the private leaderboard!

I joined this competition a bit too late, so my first priority is about getting better in Pytorch. Did anyone write their scripts using Pytorch and willing to share it publicly? It seems most of the public kernels on Kaggle were written in Keras.

@VishnuSubramanian Are you the author of Deep Learning with PyTorch? Your book has been helpful for me to get started with Pytorch!

ecdrid · March 21, 2018, 3:52am

I am reading that book, it’s having fast.ai like structure