What does the encoder actually learn? šŸ¤”

As discussed in the class 10 in-class thread, it might be useful to evaluate the quality of our encoders. An in-depth evaluation of what the encoders have actually learned could be useful.

So, Iā€™ve started putting together a notebook to conduct experiments. Please review and share your thoughts.

We will also compare the quality of our NLP encoder backbones to the TF hub universal encoder : colab notebook link.

10 Likes

Friends,

If any of you have spent time training LM and Classifier backbones, Iā€™d appreciate if you could run the above experiments and report your results.

This is purely inference and should run in a few seconds.

Yes good point, we will need to benchmark the phrase and sentence similarity scores.

some options to proceed:

My gut feeling is that our pre-trained backbones can very well give us a good headstart in training these STS models rather than training from scratch. Iā€™ll start with the quora kaggle dataset and post results soon.

1 Like

Iā€™m getting better results on the STS tasks after additional training on the Quora kaggle dataset.
Please see: https://github.com/arvind-cp/fastai/blob/arvind-cp-LM-eval/courses/dl2/Quora.ipynb

Next steps: commutate Quora dataset to and train on 2xdataset size.

1 Like

Oh no!!

ā€œThe fall of RNN / LSTMā€ @culurciello https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

3 Likes

:laughing: The idea that we can ā€˜throw away RNNsā€™ and just use attentional models is ridiculous, particularly in light of the fact that SotA solutions are being implemented on RNNs regularly and SotA is changing so frequently. Attention is an awesome mechanism but thereā€™s no need to ignore a whole class of solutions just because you like another.

4 Likes

As you can see in my notebook above, Iā€™m betting big on GRUs and rnns. And the whole point of this exercise is to pit our LMā€™s encoder against the vaswani transformer based encoder (from tf hub).

I totally echo your sentiment & Iā€™ll keep pushing and see what our encoders can do.

1 Like

@sgugger : would appreciate your comments on my experiment above.

I love how the pre-trained models already detects some similarity like this. Did you try to make fine-tuning it to the quora task? Itā€™d be interesting to see if it gets even better results then.

1 Like

Sure. See the results here:

Since creating above notebook, and seeing better results, Iā€™vestarted training it some more using the commutated dataset and a trick Iā€™m calling wide-GRUsā€¦will post the updated notebook soon.

2 Likes

No, what I meant was finetune the Language Model to this subset and then see what the cosine similarities looked like. Because for now you just load the weights of the pre-trained model and adapt them to this specific embedding (or did I miss something?).

1 Like

Yes, youā€™re right. Thatā€™s what i plan to do next.

Also, as you can see, Iā€™m treating this as a regression objective as opposed to classification. Since cosine similarity is (0ā€¦1) and y is given as (0,1) I thought I might take advantage of it and make the model work harder.

Have you come across such a method before? I had a gut feeling, set it up and kicked off the training. Need to see how well it does.

You shouldnā€™t rely too much on my advice: I have exactly five months of experience in deep learning :wink:
Itā€™s a bit like when we use a sigmoid to classify between 0 and 1, so using the cosine similarity here doesnā€™t shock me.

1 Like

Sure, but I do value your knowledge and the quality of your posts/notebooks.

More than the cosine similarity, Iā€™m talking about the L1 loss(regression) as opposed to BCELoss(classification).
0 and 1 are not just labels(string) but treated as a measure(int) of similarity.
Butā€¦just to be sureā€¦Iā€™ve kicked off a BCELoss classifier as wellā€¦will keep an eye on both

For me itā€™s the cosine similarity thatā€™s really interesting, L1 loss just takes the mean of the differences with the targets, whereas this is the thing that makes object in this big-dimension space look close or not. Let me know if it works better than a the classic BCELoss!

1 Like

Yeahā€¦soā€¦the BCELoss was a bad ideaā€¦I killed it.

The loss surface was like the tibetian plateauā€¦what was I thinking?!?

I had a bug in my codeā€¦lost a day :frowning:

Alrightā€¦soā€¦I totally miscalculated that the Quora dataset is a good representation of ā€œallā€ sentence pairs.
So while my custom pair similarity tests gave reasonable accuracy scores, the Quora dataset log loss was terrible.

I have loaded up the model + saved weights, changed to a new loss crit(log loss) and started training again. At least Iā€™m not starting from scratch.
If this doesnā€™t work, Iā€™ll put this piece on hold and work on the LM fine-tune for the STS task.

@narvind2003 @sgugger Iā€™ve been working on getting cosine similarity working instead of MCE for language models. Iā€™ve got a working implementation, but Iā€™m still trying to find a loss function that works well. Mixed with cross entropy it works okay, and slightly improves the accuracy of the model, at a slight expense of mce/perplexity.

By itself it fails to converge to a good solution, I think because thereā€™s a mode collapse of just pushing all of the vectors closer together. Iā€™m working on an alternative loss function that prevents that. If that doesnā€™t work I was planning on writing up a blog post to celebrate the failure.

1 Like

Very cool! I look forward to hearing more!

If I understood correctly, you are talking about the LM and not the LM backed classifier.

Are you trying to get the LM to predict target words based on cosine similarity + CE?
By increased accuracy, do you mean that the LM somehow predicted the expected target more often with cosine similarity than without?