Kaggle NLP Competition - Toxic Comment Classification Challenge

@jamesrequa thanks for this. Ill give it a shot.

I cant use the whole data set as I keep on running out of memory so I used code to break the training and testing datasets into 5 chunks and then training each chunk (approx approx 35000 rows) and then saving the output predictions. I hence have 5 prediction files and was using the method above to join them all into 1 file.

Were you able to use the whole dataset without running out of memory or were you also using smaller chunks?

Lol, anything you touch turns into 0.9865+ ))

Thanks for the hint once again.

@amritv I was able to train with all of the data. What GPU are you using?

Do you mean that you trained on each class separately? Or you just split it up randomly?

If you split training data randomly into 5 chunks and still predicted on all of the test data (all classes) then it won’t affect the shape of your test predictions and you could still use the reference code I provided before. Personally, I don’t think you should predict on only part of the test data each time (if you aren’t splitting by class) unless you ran into memory issues with test data as well, but I don’t think that should take too much memory to just run predictions.

If you split training and generated predictions separately by class then you could still create a new submission file from scratch without any concatenations. If you are having memory issues then I highly recommend using bcolz to save and reload your prediction arrays.

Below is an example of how that process might look.

  1. Use the below bcolz functions for saving and loading prediction arrays.

    import bcolz

    def save_array(fname, arr): c=bcolz.carray(arr, rootdir=fname, mode=‘w’); c.flush()
    def load_array(fname): return bcolz.open(fname)[:]

  2. You’ll want to save pred arrays each time after you run predictions on each chunk
    save_array('toxic_preds.bc', toxic_preds)

  3. Once you have finished saving all 5 prediction arrays then reload them all back into one submission file. You could run this in a new notebook and make sure the training notebook is shutdown to save memory.

    test_ids = pd.read_csv(’./input/sample_submission.csv’).id.values
    columns = [‘id’,‘toxic’, ‘severe_toxic’, ‘obscene’, ‘threat’, ‘insult’, ‘identity_hate’]
    submission = pd.DataFrame(index=range(0,len(test_ids)), columns=columns)
    submission[“toxic”] = load_array(‘toxic_preds.bc’)
    submission[“severe_toxic”] = load_array(‘severe_toxic_preds.bc’)
    submission[“obscene”] = load_array(‘obscene_preds.bc’)
    submission[“threat”] = load_array(‘threat_preds.bc’)
    submission[“insult”] = load_array(‘insult_preds.bc’)
    submission[“identity_hate”] = load_array(‘identity_hate_preds.bc’)
    submission.to_csv(‘submission.csv’, index=False)

Or if you just wanted to average your test predictions and hopefully get better overall predictions you could still use bcolz for that to save the arrays separately first and then reload them back in to take the average before saving the final average predictions into one submission file.

3 Likes

May I ask what is your local CV score in this case and what number of folds?

I am using 10 folds. The average roc on 10 fold is 0.9910133397

3 Likes

Wooaa! you are killing it.

Are you using the capsnet from kaggle? May I ask what config are you using?

I tried the same config as was there in the kernel. I tried playing with different parameters , but nothing improved my single model score.

@jamesrequa - extremely helpful! I really appreciate your in depth solution. I was finally able to get submissions in. :+1:

2 Likes

Finally cross theshold 0.98 threshold with a single model (pytorch LSTM with fasttext embedding using fast.ai goodness to take advantage of SGDR, etc…).

I’ve done minimal pre-processing and curious what folks here are doing with their comment text. In particular, are you …

  1. Removing stop words?
  2. Using lemmatization in your tokenization?
  3. What are you removing in your cleanup and/or tokenization process? For example, I’m doing my best to remove URLs, usernames and replace emojis with their textual equivalent.

Also, I’ve found thus far that a bi-directional LSTM outperforms a bi-directional GRU (all else being the same). Thought that was interesting.

Good luck to all in the competition!

3 Likes

I tried:

  • lemmatize
  • morphy
  • different tokenisations
  • removing special symbols
  • contractions processing
  • removing digits
  • lowercase
  • mapping words that were not found in embedding dictionary to top 100 positive / negative words based on LR weights

Nothing worked ). I changed % of unknown words from 1.23 to 0.73 but accuracy did not change.

3 Likes

Any suggestions on the final day of the competition ? )

Best SINGLE model (a simple bidirectional LSTM using fast.ai for lr annealing and SGDR) using a 10 fold CV dataset gave me: .9814

Interesting notes:

  1. Score improved by 0.002 by NOT using a pre-trained embedding (was using fasttest300).
  2. Tried training using AdamW (using theWeightDecaySchedule callback) but it was worse than just using Adam optimizer with wds = 1e-5
3 Likes

Finally I managed to improve score of a single model with significant cleaning:

  • if word in emb vocab: cleand_comment.append(word)
    • if word.lowercase() in emb vocab? cleand_comment.append(word.lowercase())
      • for w in sorted(emb_vocab, from long to short): if w in word; word = word.replace(w, ‘’); cleand_comment.append(w)
7 Likes

Not sure I understand what you are doing here (any elaboration you can provide would be helpful), but agreed that cleanup improves score.

Would love to see how folks determine WHAT to cleanup based on the particular corpus they are looking at.

14 Likes

Good stuff @sermakarevich.

Love how you structured your experiments and a very interesting approach to cleaning up the corpus.

Thanks for sharing your notebook! It’s nice to be able to run the code while learning how you approached the problem.

I created a PR (https://github.com/sermakarevich/jigsaw-toxic-comment-classification-challenge/pull/1) in case others want an easy way to install all the notebook’s dependencies using pip install -r.

Hi everyone, I hope whoever participated in the competition did well on the private leaderboard!

I joined this competition a bit too late, so my first priority is about getting better in Pytorch. Did anyone write their scripts using Pytorch and willing to share it publicly? It seems most of the public kernels on Kaggle were written in Keras.

@VishnuSubramanian Are you the author of Deep Learning with PyTorch? Your book has been helpful for me to get started with Pytorch!

I am reading that book, it’s having fast.ai like structure