What does the encoder actually learn? 🤔

(Arvind Nagaraj) #1

As discussed in the class 10 in-class thread, it might be useful to evaluate the quality of our encoders. An in-depth evaluation of what the encoders have actually learned could be useful.

So, I’ve started putting together a notebook to conduct experiments. Please review and share your thoughts.

We will also compare the quality of our NLP encoder backbones to the TF hub universal encoder : colab notebook link.

(Arvind Nagaraj) #2


If any of you have spent time training LM and Classifier backbones, I’d appreciate if you could run the above experiments and report your results.

This is purely inference and should run in a few seconds.

Sentence similarity
(Arvind Nagaraj) #4

Yes good point, we will need to benchmark the phrase and sentence similarity scores.

some options to proceed:

My gut feeling is that our pre-trained backbones can very well give us a good headstart in training these STS models rather than training from scratch. I’ll start with the quora kaggle dataset and post results soon.

(Arvind Nagaraj) #8

I’m getting better results on the STS tasks after additional training on the Quora kaggle dataset.
Please see: https://github.com/arvind-cp/fastai/blob/arvind-cp-LM-eval/courses/dl2/Quora.ipynb

Next steps: commutate Quora dataset to and train on 2xdataset size.

(Arvind Nagaraj) #9

Oh no!!

“The fall of RNN / LSTM” @culurciello https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

(Even Oldridge) #10

:laughing: The idea that we can ‘throw away RNNs’ and just use attentional models is ridiculous, particularly in light of the fact that SotA solutions are being implemented on RNNs regularly and SotA is changing so frequently. Attention is an awesome mechanism but there’s no need to ignore a whole class of solutions just because you like another.

(Arvind Nagaraj) #11

As you can see in my notebook above, I’m betting big on GRUs and rnns. And the whole point of this exercise is to pit our LM’s encoder against the vaswani transformer based encoder (from tf hub).

I totally echo your sentiment & I’ll keep pushing and see what our encoders can do.

(Arvind Nagaraj) #12

@sgugger : would appreciate your comments on my experiment above.


I love how the pre-trained models already detects some similarity like this. Did you try to make fine-tuning it to the quora task? It’d be interesting to see if it gets even better results then.

(Arvind Nagaraj) #14

Sure. See the results here:

Since creating above notebook, and seeing better results, I’vestarted training it some more using the commutated dataset and a trick I’m calling wide-GRUs…will post the updated notebook soon.


No, what I meant was finetune the Language Model to this subset and then see what the cosine similarities looked like. Because for now you just load the weights of the pre-trained model and adapt them to this specific embedding (or did I miss something?).

(Arvind Nagaraj) #16

Yes, you’re right. That’s what i plan to do next.

Also, as you can see, I’m treating this as a regression objective as opposed to classification. Since cosine similarity is (0…1) and y is given as (0,1) I thought I might take advantage of it and make the model work harder.

Have you come across such a method before? I had a gut feeling, set it up and kicked off the training. Need to see how well it does.


You shouldn’t rely too much on my advice: I have exactly five months of experience in deep learning :wink:
It’s a bit like when we use a sigmoid to classify between 0 and 1, so using the cosine similarity here doesn’t shock me.

(Arvind Nagaraj) #18

Sure, but I do value your knowledge and the quality of your posts/notebooks.

More than the cosine similarity, I’m talking about the L1 loss(regression) as opposed to BCELoss(classification).
0 and 1 are not just labels(string) but treated as a measure(int) of similarity.
But…just to be sure…I’ve kicked off a BCELoss classifier as well…will keep an eye on both


For me it’s the cosine similarity that’s really interesting, L1 loss just takes the mean of the differences with the targets, whereas this is the thing that makes object in this big-dimension space look close or not. Let me know if it works better than a the classic BCELoss!

(Arvind Nagaraj) #20

Yeah…so…the BCELoss was a bad idea…I killed it.

The loss surface was like the tibetian plateau…what was I thinking?!?

(Arvind Nagaraj) #21

I had a bug in my code…lost a day :frowning:

(Arvind Nagaraj) #22

Alright…so…I totally miscalculated that the Quora dataset is a good representation of “all” sentence pairs.
So while my custom pair similarity tests gave reasonable accuracy scores, the Quora dataset log loss was terrible.

I have loaded up the model + saved weights, changed to a new loss crit(log loss) and started training again. At least I’m not starting from scratch.
If this doesn’t work, I’ll put this piece on hold and work on the LM fine-tune for the STS task.

(Even Oldridge) #23

@narvind2003 @sgugger I’ve been working on getting cosine similarity working instead of MCE for language models. I’ve got a working implementation, but I’m still trying to find a loss function that works well. Mixed with cross entropy it works okay, and slightly improves the accuracy of the model, at a slight expense of mce/perplexity.

By itself it fails to converge to a good solution, I think because there’s a mode collapse of just pushing all of the vectors closer together. I’m working on an alternative loss function that prevents that. If that doesn’t work I was planning on writing up a blog post to celebrate the failure.

(Arvind Nagaraj) #24

Very cool! I look forward to hearing more!

If I understood correctly, you are talking about the LM and not the LM backed classifier.

Are you trying to get the LM to predict target words based on cosine similarity + CE?
By increased accuracy, do you mean that the LM somehow predicted the expected target more often with cosine similarity than without?