Can you share the benchmarks here? Iāve got the USE benchmarks. Maybe we can make a repo comparing performance of fine-tuned models on a range of commonly used NLP datasets.
Yes, I havenāt started the LM backed classifier yet. Iāll share the results as soon as possible.
Thatās correct.
Iām looking at both cosine similarity and the combination of cosine similarity and CE.
Exactly. So while the modelās perplexity is somewhat lower the accuracy of word prediction is slightly higher. Itās hard to tell if thatās directly resulting from the alternative loss function or not. Iām still working on improving the loss by eliminating the possibility of mode collapse.
Soā¦as Iāve been reporting, I tried a couple of ways we could improve the results in the Quora notebook. I tried 2 things:
- Commutating columns to double the dataset
- Add additional uni-directional GRUs layer to provide alternate pathways for our gradients to flow(ala resnets)
Finally I compared gereral semantic evaluation results and checked how we performed specifically on the Quora set(we donāt do well).
All results are here:
Though the results are not great, I have a better idea of the problem Iām up against. I hope to work on the real LM backed STS task next.
I was getting 0.38 with USE + 3 layer FCN. The vanilla embeddings with FitLaM were only obtained by training on wikitext and are expected to not have a good understanding of question structure. In your notebook it isnāt clear if you finetuned the LM or not before the classifier backbone. If not, then that should hopefully improve performance.
Waitā¦you used the LM backbone or only the weights from the embedding layer?
Iād used the Transformer model from Universal Sentence Encoder for getting the vector representations of sentences. I was relying on you to provide the FitLaM comparison
Oh USE = universal sentence encoder!
Got it.
Will post results soonā¦Iāve been busy at workā¦
Can you share your code please? Do you have a colab notebook?
I took the embedding weights not only after fine tuning but also further training the imdb classifier. I suspect itās the objective I was training for, that messed up the Quora specific log loss.
Sure, just in the process of documenting the notebooks. Will share the repo and data soon.
BTW, did you mean 0.3 validation loss or test loss with the universal sentence encoder?
I thought you made a kaggle submission with the test set(which is massive compared to the train set) and got a 0.3 NLL.
I meant validation loss. Hereās the link to the GitHub repo: https://github.com/RudrakshTuwani/transfer-learning-quora. Iām yet to commit code for Neural Network baselines on USE. Itās really a work in progress as of now, and you should probably look at it after 2-3 days.
In case youāre interested in playing with the USE embeddings yourself, the link to numpy arrays along with custom dataset and dataloaders are available in the āUniversal Sentence Encoderā notebook.
Lovely! Thanks man.
Did you try using both arccos and cosine sim and find any improvement with arccos?
Yeah I did, there wasnāt any improvement.
Okā¦I have some news. After ditching all other methods, I found some time to work on the FitLam based model.
The kaggle Quora duplicates winner got log loss 0.11 after ensembling a gazillion models and feature engg.
Our FitLam single model with just a bit of training gives 0.19 straight out of the box! Holy cow!!
Thereās still bidir, concat pooling and other stuff try! So when @jeremy and Sebastian say that FitLam is akin to alexnet, for NLPā¦itās not to be taken lightly!
Note: the default methods in the fast.ai stepper class donāt allow for input X pairs/lists of unequal lengths. So I had to make a minor edits. let me know if you need more info.
Iād be interested to hear what you did here.
I actually took care of the bulk of it in my Pairdataset, if you see the notebook, under the section: Create dataloader for Quora classifier.
So in the stepper class, I modified the step and evaluate methods where self.m is called.
If len(xs) > 1 pass [xs] else pass xs to the model.
BTW, the validation set accuracy is 98.11% which I didnāt include in the notebook.
And a question Iāve had in the back of my mind for a while now:
Why only 1 backbone?
Whatās stopping us from having multiple FitLam backbones and let a custom head use an attention mechanism to ālearnā how to effectively deal with them.
Then you can do:
learn[0][0].load(āwikitext103ā)
learn[0][1].load(āimdbā)
learn[0][2].load(āquoraā)
Itās sort of like how they load the jiujitsu program into neoās head in the matrix.
Are there any papers/projects where this is shown to work/not-work?
And finally, I was finding that slowly reducing BPTT from 70, the source model(wikitext103) down to 20, (mean + std.dev of Quora question lengths) helped with the training greatly. It was just a fluke perhaps, but it seemed like an intuitive thing to try - the LM model works harder to predict next word with a shorter input sequenceā¦+ I was mainly trying to get bigger batches into the GPU. So, I am calling this bptt annealing.
Not sure if this is also worth pursuing, and if it has a proper name in the literature.
Seeking feedback from experts and @jeremy.
Thanks!