As discussed in the class 10 in-class thread, it might be useful to evaluate the quality of our encoders. An in-depth evaluation of what the encoders have actually learned could be useful.
So, Iāve started putting together a notebook to conduct experiments. Please review and share your thoughts.
We will also compare the quality of our NLP encoder backbones to the TF hub universal encoder : colab notebook link.
My gut feeling is that our pre-trained backbones can very well give us a good headstart in training these STS models rather than training from scratch. Iāll start with the quora kaggle dataset and post results soon.
The idea that we can āthrow away RNNsā and just use attentional models is ridiculous, particularly in light of the fact that SotA solutions are being implemented on RNNs regularly and SotA is changing so frequently. Attention is an awesome mechanism but thereās no need to ignore a whole class of solutions just because you like another.
As you can see in my notebook above, Iām betting big on GRUs and rnns. And the whole point of this exercise is to pit our LMās encoder against the vaswani transformer based encoder (from tf hub).
I totally echo your sentiment & Iāll keep pushing and see what our encoders can do.
I love how the pre-trained models already detects some similarity like this. Did you try to make fine-tuning it to the quora task? Itād be interesting to see if it gets even better results then.
Since creating above notebook, and seeing better results, Iāvestarted training it some more using the commutated dataset and a trick Iām calling wide-GRUsā¦will post the updated notebook soon.
No, what I meant was finetune the Language Model to this subset and then see what the cosine similarities looked like. Because for now you just load the weights of the pre-trained model and adapt them to this specific embedding (or did I miss something?).
Yes, youāre right. Thatās what i plan to do next.
Also, as you can see, Iām treating this as a regression objective as opposed to classification. Since cosine similarity is (0ā¦1) and y is given as (0,1) I thought I might take advantage of it and make the model work harder.
Have you come across such a method before? I had a gut feeling, set it up and kicked off the training. Need to see how well it does.
You shouldnāt rely too much on my advice: I have exactly five months of experience in deep learning
Itās a bit like when we use a sigmoid to classify between 0 and 1, so using the cosine similarity here doesnāt shock me.
Sure, but I do value your knowledge and the quality of your posts/notebooks.
More than the cosine similarity, Iām talking about the L1 loss(regression) as opposed to BCELoss(classification).
0 and 1 are not just labels(string) but treated as a measure(int) of similarity.
Butā¦just to be sureā¦Iāve kicked off a BCELoss classifier as wellā¦will keep an eye on both
For me itās the cosine similarity thatās really interesting, L1 loss just takes the mean of the differences with the targets, whereas this is the thing that makes object in this big-dimension space look close or not. Let me know if it works better than a the classic BCELoss!
Alrightā¦soā¦I totally miscalculated that the Quora dataset is a good representation of āallā sentence pairs.
So while my custom pair similarity tests gave reasonable accuracy scores, the Quora dataset log loss was terrible.
I have loaded up the model + saved weights, changed to a new loss crit(log loss) and started training again. At least Iām not starting from scratch.
If this doesnāt work, Iāll put this piece on hold and work on the LM fine-tune for the STS task.
@narvind2003@sgugger Iāve been working on getting cosine similarity working instead of MCE for language models. Iāve got a working implementation, but Iām still trying to find a loss function that works well. Mixed with cross entropy it works okay, and slightly improves the accuracy of the model, at a slight expense of mce/perplexity.
By itself it fails to converge to a good solution, I think because thereās a mode collapse of just pushing all of the vectors closer together. Iām working on an alternative loss function that prevents that. If that doesnāt work I was planning on writing up a blog post to celebrate the failure.
If I understood correctly, you are talking about the LM and not the LM backed classifier.
Are you trying to get the LM to predict target words based on cosine similarity + CE?
By increased accuracy, do you mean that the LM somehow predicted the expected target more often with cosine similarity than without?