Lovely! Thanks man.
Did you try using both arccos and cosine sim and find any improvement with arccos?
Yeah I did, there wasn’t any improvement.
Ok…I have some news. After ditching all other methods, I found some time to work on the FitLam based model.
The kaggle Quora duplicates winner got log loss 0.11 after ensembling a gazillion models and feature engg.
Our FitLam single model with just a bit of training gives 0.19 straight out of the box! Holy cow!!
There’s still bidir, concat pooling and other stuff try! So when @jeremy and Sebastian say that FitLam is akin to alexnet, for NLP…it’s not to be taken lightly!
Note: the default methods in the fast.ai stepper class don’t allow for input X pairs/lists of unequal lengths. So I had to make a minor edits. let me know if you need more info.
I’d be interested to hear what you did here.
I actually took care of the bulk of it in my Pairdataset, if you see the notebook, under the section: Create dataloader for Quora classifier.
So in the stepper class, I modified the step and evaluate methods where self.m is called.
If len(xs) > 1 pass [xs] else pass xs to the model.
BTW, the validation set accuracy is 98.11% which I didn’t include in the notebook.
And a question I’ve had in the back of my mind for a while now:
Why only 1 backbone?
What’s stopping us from having multiple FitLam backbones and let a custom head use an attention mechanism to “learn” how to effectively deal with them.
Then you can do:
It’s sort of like how they load the jiujitsu program into neo’s head in the matrix.
Are there any papers/projects where this is shown to work/not-work?
And finally, I was finding that slowly reducing BPTT from 70, the source model(wikitext103) down to 20, (mean + std.dev of Quora question lengths) helped with the training greatly. It was just a fluke perhaps, but it seemed like an intuitive thing to try - the LM model works harder to predict next word with a shorter input sequence…+ I was mainly trying to get bigger batches into the GPU. So, I am calling this bptt annealing.
Not sure if this is also worth pursuing, and if it has a proper name in the literature.
Seeking feedback from experts and @jeremy.
That’s awesome! I haven’t had much luck with freezing the encoder and using the last hidden state as input, with our without LM finetuning. Glad to know that end-to-end finetuning works!
Thanks @rudraksh . I’m not sure I follow what you meant.
I’m hoping to try out a few more tricks, see if we can get to 0.11 log loss and then move on to the universal sentence encoder based backbone. Please let me know how we can collaborate.
So, in order to facilitate a fair comparison with USE, I froze the LM backbone and passed both the questions individually through the encoder, taking the last hidden state as the question embedding. These embeddings were then concatenated and an MLP was trained to output whether they are duplicate or not. Unfortunately, this did not work well, and I wasn’t able to reduce neg log loss on the validation set below 0.6. I then tried fine tuning the LM backbone on all the questions and repeating the above procedure, in this case, the neg log loss plateaued around 0.5
In retrospect, assuming that the last hidden state captures all the semantics of the sentence seems kinda naive. Maybe all the hidden states at each timestep need to be combined (via an attention mechanism?) in order to get the question embedding.
Yes…all states could add value here. But even I took the last hidden state from the FitLam encoder.
Also, I didn’t do cosine similarity. I just spit out the last layer neuron into the BCELogloss.
Finally, I did unfreeze and train e2e to get good results. Could you please try that?
The whole point of LM fine tuning is the fine tuning. So I’m not sure how useful a comparison it is to freeze the LM backbone!
I did try fine-tuning the LM backbone but unfortunately, that didn’t really help it outperform USE defaults. (0.5 nll vs 0.3). End-to-end finetuning as @narvind2003 suggested definitely gave FitLaM the edge and I was able to reach 0.25. Possibly we can incorporate multi-task learning and try training on datasets other than wt103 to improve upon the pretrained weights of FitLaM, as suggested in your paper.
You mean pretrained weights, correct?
I asked the Tensorflow hub team what data they used to pretrain their transformer model. I didn’t get a clear response. One if the reasons I think it works so well is the quality & volume of data they have used. That’s the hunch anyway and we need to benchmark the transformer vs awd-lstm backbones to really find out.
More than volume I think diversity in datasets and training objectives is the key to their good out of the box performance.
@narvind2003 Dude, I think there’s some data leakage between your training and validation set. In order to make the model insensitive to a specific question order in a pair, you basically duplicate your labelled data and switch the question order. Ideally you should split your data into training and validation set before this duplication business and not after.