But I fail to understand where does this fig come from - 3008? How can we change it if we want?
Secondly this is a Notebook Only Competition -
Very surprisingly and annoyingly, while I am submitting my submission file, its giving me a funny error - Submission Scoring Error. I checked and found all the rules of competition are followed. Can anyone who is participating in this help me with this as well?
Did you ever figure out your questions? I’m not sure about the first question, but I ran into some similar troubles for the Kannada MNIST challenge, which is also a kernels-only challenge.
I received submission errors when I had some numbers hardcoded in my submission. For example, if there are 100 items in the test file, you would want to do something like len(testfile), as opposed to saying testfile_len = 100.
Thanks Melissa, this is really awesome. With internet being not accessible for the kernel, if we can use RoBERTa, BERT or other acrhs, it will be really fantastic. I will go through your kernel asap and share mine too if I can manage to improve the score.
I’m sure you can improve the score if the model is trained a bit more. I plan to go through and see how many more layers I can unfreeze in training before the memory runs out.
There’s a little bit of a trick to getting the huggingface models to work on the internet disabled kernel. On your cloud/home computer, you’ll need to save the tokenizer, config and model with .save_pretrained(). Then, you can upload those files as a dataset to use with the .from_pretrained() command.
It took me quite an angsty weekend to figure that silly part out.
I just published a summary of the Top 5 winning solutions here:
Some really interesting solutions; combining multiple transformers into a single model, differiential lrs, pseudo labelling and post-processing (either binning or thresholding) were all key. Also interestingly not a huge amount of text preprocessing was mentioned in the top solutions.
Just published 4 notebooks that go from pre-training a language model all the way to test set prediction and creation of submission file. All using AWD-LSTM and fastai v2
NB1. Q&A Data for Pretraining
Processes and combines 3 different text datasets into a single source ready for language model pre-training. This notebook outputs a 850mb text data file with 84M words/tokens with the following distribution:
65% from wiki103
18% from Tensorflow 2.0 Q&A
17% from the StackSample dataset
NB 2. Pretraining an AWD LSTM model with fastai v2
This notebook will pretrain an AWD LSTM model using a custom text dataset designed especially for this Q&A competition.
The SentencePiece Tokenizer with Byte-Pair Encoder (bpe) was used for tokenization instead of the standard fastai Spacy tokenizer. It was trained for 7 epochs and it took 2h14m per epoch.
NB 3. Language Model Finetuning on competition Q&A
Finetune the pretrained AWD LSTM Language Model on the competition Q&A data. Because we are finetuning the LM, we can use all of the competition data, both the train and test set.
NB 4. AWD LSTM Q&A classification and prediction
Test set classification and prediction .
Custom Transform
One thing I had to do to get the classification working in fastai v2 was to create a custom transform in order to input and display the 30 float targets for this competition for use in the y_tfms , like so:
y_tfms = [GetMultiColFloatLabels(label_cols)]
Full Tranfrom code:
class TensorMultiColLabels(TensorBase): pass
class GetMultiColFloatLabels(Transform):
'Transform to grab multiple float labels form multiple columns from a df'
order=1
def __init__(self, label_cols:list=None, c:int=None):
if not isinstance(label_cols, list): TypeError(f'label_cols must be a list')
self.label_cols = label_cols
if c is None:
self.c = len(label_cols)
else: self.c = c
def encodes(self, o): return TensorMultiColLabels(tensor(list(o[self.label_cols])).float())
# Return dict which gets parsed in the custom show_batch function
def decodes(self, o): return {label_cols[i]:o[i] for i in range(o.size()[0])}
@typedispatch
def show_batch(x: TensorText, y:TensorMultiColLabels, samples, ctxs=None, max_n=10, trunc_at=150, **kwargs):
samples = L((s[0].truncate(trunc_at),*s[1:]) for s in samples)
fin_ls = []
for i,s in enumerate(samples): fin_ls.append([s[0]] + list(s[1].values()))
cols = ['doc'] + list(s[1].keys())
display_df(pd.DataFrame(fin_ls, columns = cols))
return fin_ls