That’s awesome! I haven’t had much luck with freezing the encoder and using the last hidden state as input, with our without LM finetuning. Glad to know that end-to-end finetuning works!
Thanks @rudraksh . I’m not sure I follow what you meant.
I’m hoping to try out a few more tricks, see if we can get to 0.11 log loss and then move on to the universal sentence encoder based backbone. Please let me know how we can collaborate.
So, in order to facilitate a fair comparison with USE, I froze the LM backbone and passed both the questions individually through the encoder, taking the last hidden state as the question embedding. These embeddings were then concatenated and an MLP was trained to output whether they are duplicate or not. Unfortunately, this did not work well, and I wasn’t able to reduce neg log loss on the validation set below 0.6. I then tried fine tuning the LM backbone on all the questions and repeating the above procedure, in this case, the neg log loss plateaued around 0.5
In retrospect, assuming that the last hidden state captures all the semantics of the sentence seems kinda naive. Maybe all the hidden states at each timestep need to be combined (via an attention mechanism?) in order to get the question embedding.
Yes…all states could add value here. But even I took the last hidden state from the FitLam encoder.
Also, I didn’t do cosine similarity. I just spit out the last layer neuron into the BCELogloss.
Finally, I did unfreeze and train e2e to get good results. Could you please try that?
The whole point of LM fine tuning is the fine tuning. So I’m not sure how useful a comparison it is to freeze the LM backbone!
I did try fine-tuning the LM backbone but unfortunately, that didn’t really help it outperform USE defaults. (0.5 nll vs 0.3). End-to-end finetuning as @narvind2003 suggested definitely gave FitLaM the edge and I was able to reach 0.25. Possibly we can incorporate multi-task learning and try training on datasets other than wt103 to improve upon the pretrained weights of FitLaM, as suggested in your paper.
You mean pretrained weights, correct?
I asked the Tensorflow hub team what data they used to pretrain their transformer model. I didn’t get a clear response. One if the reasons I think it works so well is the quality & volume of data they have used. That’s the hunch anyway and we need to benchmark the transformer vs awd-lstm backbones to really find out.
More than volume I think diversity in datasets and training objectives is the key to their good out of the box performance.
@narvind2003 Dude, I think there’s some data leakage between your training and validation set. In order to make the model insensitive to a specific question order in a pair, you basically duplicate your labelled data and switch the question order. Ideally you should split your data into training and validation set before this duplication business and not after.
Oh ok. Did that change make a difference in your training?
I wasn’t able to get a nll below 0.34 on the validation set, although my implementation is slightly different from yours.
I tweaked the dropouts quite a bit to avoid overfitting. I could get trn nll 0.074 and val nll 0.154. had to use extremely low alpha and the needle barely moved after training overnight on volta100.
Note: this is the exact train-val splits and fitlam backed model I’ve been using so far. Kaggle is throwing a submission error - I could submit yesterday though - says competition is inactive.
Hope it lets me submit again…in the meantime, let me perform the “human evaluation” of the semantic understanding of these models - the original thing I set out to do.
Yeah, or maybe it’s because I only did one epoch of LM training. I’m also only able to use 30 bptt and batch sizes of 16 due to GPU limitation.
I’ll try increasing batch size and pretraining epochs of lm to see if the performance improves.
I’m confused why Delip Rao says this. Can anyone clarify please?
Check out @deliprao’s Tweet: https://twitter.com/deliprao/status/992583524812115969?s=09
Update: I asked Jeremy who kindly clarified on the same tweet thread. Thanks.
I finally found what I was looking for!!!
In the MERLIN talk by Greg Wayne et al @ deepmind, he mentions this idea of cutting down BPTT to smaller chunks to assign credits to shorter time intervals.
Link - watch at time 34:36
This is in a reinforcement learning setting, where the agent moves around the world and thinks about which events in the recent past have most effect on what it’s seeing/doing currently. Using truncated BPTT, they don’t have to look too much back in time. And it makes intuitve sense… like if your stomach is upset, it’s probably what you ate a while ago, not what you ate 2 days ago.
This is the first time something which had struck me while I was working on a model became clearer when watching a totally different talk…even though it doesn’t work quite reliably in NLP
Mind = Blown!