Student Project: Toxic Comment Classification (Kaggle)

matthewgonz0 · March 31, 2020, 2:34am

This thread was created after brainstorming projects with the SF virtual study group. The deadline for this competition (https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification) is June 22nd and I’m hoping to gather some ideas and get started as soon as possible.

Goal of the project: Win this competition using fast.ai!

ilovescience · March 31, 2020, 2:37am

To do well in this competition, you will likely need a good interface between fastai and huggingface transformers (currently the best DL for NLP package). AFAIK, none exists right now.

muellerzr · March 31, 2020, 2:38am

It actually does, via FastHugs by @morgan

GitHub - morganmcg1/fasthugs: Use fastai-v2 with HuggingFace's pretrained transformers

Considering it’s multi lingual though perhaps giving Multi-FiT a whirl could be advantageous

ilovescience · March 31, 2020, 2:54am

It seems this will definitely help you get started. But I am not sure if it will do all the things you typically have in BERT (or other transformer training). For example, you need the tokenizer to create token_type_ids and for the model to use this. There may be some other details missing in FastHugs, but it’s worth looking into

muellerzr · March 31, 2020, 2:56am

If you check out the example notebook, he’s using HuggingFace directly doing exactly what you describe The tokenizing isn’t done by fastai but instead their tokenizer is being wrapped (SentencePiece does this behavior too) and so you declare the model to use etc

ilovescience · March 31, 2020, 2:58am

However, there’s another reason why unfortunately fastai probably cannot be used. The data is huge and in this competition, participants are mainly using TPU resources. Unfortunately, fastai does not yet have TPU support (I had been working on it in the past, and plan to continue it soon, but probably will not be ready in time for this competition).

IDK but maybe Multi_FiT is a small enough model to train on this dataset? Completely unsure though…

ilovescience · March 31, 2020, 2:58am

I don’t see token_type_ids being passed into the model and no mention of token_type_ids in the code. Am I missing something?:

        logits = self.transformer(input_ids, attention_mask = attention_mask)[0]

It’s not hard to fix this, just pointing out that whoever is working on this will need to make sure that the inputs, model definition, and outputs are the same as in regular huggingface transformer training.

muellerzr · March 31, 2020, 3:00am

You may not be and instead I am! However I @‘d Morgan so he’ll be aware of this and can comment more once he sees it (I’ve looked very minimal at the code and thought I saw what you described, but clearly I did not!)

On TPU, yeah if it is a TPU focused competition fastai won’t be much help Atleast for v2. @ilovescience IIRC did you get fastaiv1 working for TPUs? I remember you trying

ilovescience · March 31, 2020, 3:03am

Yes I did get a training loop working for ResNet but it was not fast enough. It was IO/CPU-limited and I needed to look at it further. I plan to port over my code to fastai v2, but the changes to the low-level API will make it slightly more difficult than in fastai v2.

ilovescience · March 31, 2020, 3:09am

Life has been busy due to research work, but I will let you know if I do get TPU support working in fastai v2 during this competition. Then it would be great if you guys explore how you can use @morgan’s code to train BERT-multilingual on TPU with fastai v2.

In the meantime, maybe consider looking into Tweet Sentiment Extraction. I think it is a smaller dataset. Note, this is actually a question answering problem (check out Abhishek’s video) so you will have to modify FastHugs for that purpose (FastHugs is only for classification it seems).

morgan · March 31, 2020, 9:39am

Ah brilliant, I missed the launch of this competition, will definitely give it a go in a week or two once I decompress after the deepfakes comp Would be keen to see if we can get fastai-v2 working on TPU too, could be a great showcase.

@ilovescience by “IO/CPU-limited” do you mean that it was something related to the dataloaders that was slowing things up?

Getting MultiFit up and running in fastai-v2 would be great too, I tried playing around with the QRNNs a while ago for the Google QUEST competition, but hit a bug in the SentenceEncoder (subsequently fixed, after the comp ended). Worth a look again tho! Smerity’s SHA-RNN looks super promising too and it seems to have been neglected by the NLP community to date despite its promise.

Re Fasthugs
@ilovescience thanks for pointing it out, I learned something new! I just focussed on sequence classification with FastHugs, but it looks like you’re right that token_type_ids is needed for sequence pair classification like this comp. (Nice token_type_id explainer here for folks who aren’t familiar). Shouldn’t be too much work to incorporate, will try get to it in a week or so, FastHugs PRs welcome too!