Kaggle Competition - Google QUEST

abhikjha · November 24, 2019, 8:24am

Hi Everyone

Recently there is a NLP based competition launched on Kaggle

Is anyone from Fastai community participating in it?

I have two questions which I encountered while using Fastai AWD_LSTM models to participate in this competition:

In Language Model after training, when we print this:

print(learn.model[0].encoder)

I get this:

Embedding(3008, 400, padding_idx=1)

Here, I know that embedding size can be picked up and changed if required by changing

awd_lstm_lm_config = dict( emb_sz=400, n_hid=1150, n_layers=3, pad_token=1, qrnn=False, bidir=False, output_p=0.1,
                          hidden_p=0.15, input_p=0.25, embed_p=0.02, weight_p=0.2, tie_weights=True, out_bias=True)

But I fail to understand where does this fig come from - 3008? How can we change it if we want?

Secondly this is a Notebook Only Competition -

Very surprisingly and annoyingly, while I am submitting my submission file, its giving me a funny error - Submission Scoring Error. I checked and found all the rules of competition are followed. Can anyone who is participating in this help me with this as well?

Cheers!
Abhik

mrajaram · December 17, 2019, 1:23am

hi Abhik,

Did you ever figure out your questions? I’m not sure about the first question, but I ran into some similar troubles for the Kannada MNIST challenge, which is also a kernels-only challenge.

I received submission errors when I had some numbers hardcoded in my submission. For example, if there are 100 items in the test file, you would want to do something like len(testfile), as opposed to saying testfile_len = 100.

For my first submission, I decided to use some pretrained transformers: https://www.kaggle.com/melissarajaram/roberta-fastai-huggingface-transformers

The overall score isn’t so great, but I’m just happy that I got it to run!

~Melissa

abhikjha · December 17, 2019, 2:49am

Thanks Melissa, this is really awesome. With internet being not accessible for the kernel, if we can use RoBERTa, BERT or other acrhs, it will be really fantastic. I will go through your kernel asap and share mine too if I can manage to improve the score.

Thanks a lot
Abhik

mrajaram · December 17, 2019, 4:32am

I’m sure you can improve the score if the model is trained a bit more. I plan to go through and see how many more layers I can unfreeze in training before the memory runs out.

There’s a little bit of a trick to getting the huggingface models to work on the internet disabled kernel. On your cloud/home computer, you’ll need to save the tokenizer, config and model with .save_pretrained(). Then, you can upload those files as a dataset to use with the .from_pretrained() command.

It took me quite an angsty weekend to figure that silly part out.

abhikjha · December 17, 2019, 7:26am

Ahh! thanks for sharing this trick. I actually forked your kernel and was wondering where I can see these files…

morgan · February 12, 2020, 1:49pm

Hey all,

I just published a summary of the Top 5 winning solutions here:

Some really interesting solutions; combining multiple transformers into a single model, differiential lrs, pseudo labelling and post-processing (either binning or thresholding) were all key. Also interestingly not a huge amount of text preprocessing was mentioned in the top solutions.

morgan · February 19, 2020, 7:13pm

Just published 4 notebooks that go from pre-training a language model all the way to test set prediction and creation of submission file. All using AWD-LSTM and fastai v2

NB1. Q&A Data for Pretraining

Processes and combines 3 different text datasets into a single source ready for language model pre-training. This notebook outputs a 850mb text data file with 84M words/tokens with the following distribution:

65% from wiki103
18% from Tensorflow 2.0 Q&A
17% from the StackSample dataset

NB 2. Pretraining an AWD LSTM model with fastai v2

This notebook will pretrain an AWD LSTM model using a custom text dataset designed especially for this Q&A competition.

The SentencePiece Tokenizer with Byte-Pair Encoder (bpe) was used for tokenization instead of the standard fastai Spacy tokenizer. It was trained for 7 epochs and it took 2h14m per epoch.

NB 3. Language Model Finetuning on competition Q&A

Finetune the pretrained AWD LSTM Language Model on the competition Q&A data. Because we are finetuning the LM, we can use all of the competition data, both the train and test set.

NB 4. AWD LSTM Q&A classification and prediction

Test set classification and prediction .

Custom Transform

One thing I had to do to get the classification working in fastai v2 was to create a custom transform in order to input and display the 30 float targets for this competition for use in the y_tfms , like so:

y_tfms = [GetMultiColFloatLabels(label_cols)]

Full Tranfrom code:

class TensorMultiColLabels(TensorBase):   pass

class GetMultiColFloatLabels(Transform):
    'Transform to grab multiple float labels form multiple columns from a df'
    order=1
    def __init__(self, label_cols:list=None, c:int=None):
        if not isinstance(label_cols, list): TypeError(f'label_cols must be a list')
        self.label_cols = label_cols
        if c is None: 
            self.c = len(label_cols) 
        else: self.c = c
    def encodes(self, o): return TensorMultiColLabels(tensor(list(o[self.label_cols])).float())
    # Return dict which gets parsed in the custom show_batch function
    def decodes(self, o): return {label_cols[i]:o[i] for i in range(o.size()[0])}

@typedispatch
def show_batch(x: TensorText, y:TensorMultiColLabels, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
    samples = L((s[0].truncate(trunc_at),*s[1:]) for s in samples)
    fin_ls = []
    for i,s in enumerate(samples): fin_ls.append([s[0]] + list(s[1].values()))
    cols = ['doc'] + list(s[1].keys())
    display_df(pd.DataFrame(fin_ls, columns = cols))
    return fin_ls