What does the encoder actually learn? 🤔

narvind2003 · April 14, 2018, 5:57pm

As discussed in the class 10 in-class thread, it might be useful to evaluate the quality of our encoders. An in-depth evaluation of what the encoders have actually learned could be useful.

So, I’ve started putting together a notebook to conduct experiments. Please review and share your thoughts.

We will also compare the quality of our NLP encoder backbones to the TF hub universal encoder : colab notebook link.

narvind2003 · April 15, 2018, 1:00am

Friends,

If any of you have spent time training LM and Classifier backbones, I’d appreciate if you could run the above experiments and report your results.

This is purely inference and should run in a few seconds.

narvind2003 · April 15, 2018, 4:32pm

Yes good point, we will need to benchmark the phrase and sentence similarity scores.

some options to proceed:

My gut feeling is that our pre-trained backbones can very well give us a good headstart in training these STS models rather than training from scratch. I’ll start with the quora kaggle dataset and post results soon.

narvind2003 · April 18, 2018, 9:05pm

I’m getting better results on the STS tasks after additional training on the Quora kaggle dataset.
Please see: https://github.com/arvind-cp/fastai/blob/arvind-cp-LM-eval/courses/dl2/Quora.ipynb

Next steps: commutate Quora dataset to and train on 2xdataset size.

narvind2003 · April 19, 2018, 3:04pm

Oh no!!

“The fall of RNN / LSTM” @culurciello https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0

Even · April 19, 2018, 5:29pm

The idea that we can ‘throw away RNNs’ and just use attentional models is ridiculous, particularly in light of the fact that SotA solutions are being implemented on RNNs regularly and SotA is changing so frequently. Attention is an awesome mechanism but there’s no need to ignore a whole class of solutions just because you like another.

narvind2003 · April 19, 2018, 5:43pm

As you can see in my notebook above, I’m betting big on GRUs and rnns. And the whole point of this exercise is to pit our LM’s encoder against the vaswani transformer based encoder (from tf hub).

I totally echo your sentiment & I’ll keep pushing and see what our encoders can do.

narvind2003 · April 20, 2018, 7:37pm

@sgugger : would appreciate your comments on my experiment above.

sgugger · April 20, 2018, 9:16pm

I love how the pre-trained models already detects some similarity like this. Did you try to make fine-tuning it to the quora task? It’d be interesting to see if it gets even better results then.

narvind2003 · April 20, 2018, 10:14pm

Sure. See the results here:

github.com

arvind-cp/fastai/blob/arvind-cp-LM-eval/courses/dl2/Quora.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Semantic Similarity Evaluation - using pre-trained weights from the LM/Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 372,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.text import *\n",
    "import html"
   ]
  },
  {

This file has been truncated. show original

Since creating above notebook, and seeing better results, I’vestarted training it some more using the commutated dataset and a trick I’m calling wide-GRUs…will post the updated notebook soon.

sgugger · April 21, 2018, 1:02am

No, what I meant was finetune the Language Model to this subset and then see what the cosine similarities looked like. Because for now you just load the weights of the pre-trained model and adapt them to this specific embedding (or did I miss something?).

narvind2003 · April 21, 2018, 1:17am

Yes, you’re right. That’s what i plan to do next.

Also, as you can see, I’m treating this as a regression objective as opposed to classification. Since cosine similarity is (0…1) and y is given as (0,1) I thought I might take advantage of it and make the model work harder.

Have you come across such a method before? I had a gut feeling, set it up and kicked off the training. Need to see how well it does.

sgugger · April 21, 2018, 1:43am

You shouldn’t rely too much on my advice: I have exactly five months of experience in deep learning
It’s a bit like when we use a sigmoid to classify between 0 and 1, so using the cosine similarity here doesn’t shock me.

narvind2003 · April 21, 2018, 2:01am

Sure, but I do value your knowledge and the quality of your posts/notebooks.

More than the cosine similarity, I’m talking about the L1 loss(regression) as opposed to BCELoss(classification).
0 and 1 are not just labels(string) but treated as a measure(int) of similarity.
But…just to be sure…I’ve kicked off a BCELoss classifier as well…will keep an eye on both

sgugger · April 21, 2018, 2:13am

For me it’s the cosine similarity that’s really interesting, L1 loss just takes the mean of the differences with the targets, whereas this is the thing that makes object in this big-dimension space look close or not. Let me know if it works better than a the classic BCELoss!

narvind2003 · April 21, 2018, 1:34pm

Yeah…so…the BCELoss was a bad idea…I killed it.

The loss surface was like the tibetian plateau…what was I thinking?!?

narvind2003 · April 22, 2018, 4:33pm

I had a bug in my code…lost a day

narvind2003 · April 23, 2018, 1:01am

Alright…so…I totally miscalculated that the Quora dataset is a good representation of “all” sentence pairs.
So while my custom pair similarity tests gave reasonable accuracy scores, the Quora dataset log loss was terrible.

I have loaded up the model + saved weights, changed to a new loss crit(log loss) and started training again. At least I’m not starting from scratch.
If this doesn’t work, I’ll put this piece on hold and work on the LM fine-tune for the STS task.

Even · April 23, 2018, 3:02am

@narvind2003 @sgugger I’ve been working on getting cosine similarity working instead of MCE for language models. I’ve got a working implementation, but I’m still trying to find a loss function that works well. Mixed with cross entropy it works okay, and slightly improves the accuracy of the model, at a slight expense of mce/perplexity.

By itself it fails to converge to a good solution, I think because there’s a mode collapse of just pushing all of the vectors closer together. I’m working on an alternative loss function that prevents that. If that doesn’t work I was planning on writing up a blog post to celebrate the failure.

narvind2003 · April 23, 2018, 4:03am

Very cool! I look forward to hearing more!

If I understood correctly, you are talking about the LM and not the LM backed classifier.

Are you trying to get the LM to predict target words based on cosine similarity + CE?
By increased accuracy, do you mean that the LM somehow predicted the expected target more often with cosine similarity than without?