Extracting word embeddings from ULMFiT

youcefjd · November 15, 2019, 3:29am

Hi,

Is it possible to extract embedding vectors from ULMFiT language model? If so, how? Did anyone try it before? Is it truly more efficient than BERT?

Thanks!

knesgood · November 15, 2019, 3:01pm

This thread might be of interest:

youcefjd · November 15, 2019, 3:31pm

That was…me who asked the question. But thanks anyways

averma · November 17, 2019, 6:11am

maybe I’m being too simple. but at the end of training. if you just take the embedding.weight.data you should get embeddings as an array mapped to every word.

youcefjd · November 18, 2019, 12:11am

Hi, thanks but it’s not working. Any other thoughts?

muellerzr · November 18, 2019, 12:30am

Are you saying you’re having issues accessing that weight matrix? If so, here is how I got to those weights (Assume that I created an AWD_LSTM model and did a learn.load_encoder()

net = learn.model
encoder = net[0]
enc = list(encoder.children())[0]
w = enc.encoder.weight

And then if you wanted to map them back you can see that your databunch’s vocab has the same length.

Eg: (assume IMDB_Sample)

len(learn.data.vocab[0])
7080

len(w)
7080

(There could be a simpler way than that to index them, did it on the fly)

harikrishnanrajeev · November 18, 2019, 6:05am

If we need to work on sentence similarity , which embedding will you choose. The embeddings from the Language model or the embeddings from the classifier.

averma · November 18, 2019, 11:40am

The embedding layer is the first layer of the entire network. Through backpropagation it gets updated. So you would take the weights of the embedding layer

muellerzr · November 18, 2019, 11:46am

To add to this, they’d both be extremely similar. If you don’t have a classifier the grab the language models weights. (Which is extremely similar to how I described above)

youcefjd · November 18, 2019, 4:55pm

Thanks. Can those weights be used for similarity search? Same way as Image2vec for instance.

muellerzr · November 18, 2019, 5:16pm

I’m not very familiar with similarity search so I would not be able to give a good answer on that (I’m getting to Image2Vec in the next few weeks) but I’d imagine they sound quite similar so possibly? There’s a fastai 1.0 version of Image2Vec out there that I have seen that perhaps you could start with?

youcefjd · November 18, 2019, 5:21pm

Oh there is? I haven’t seen it at all. Do you have any good links?

muellerzr · November 18, 2019, 5:22pm

Here is a 1.0: MultiFloatList - reimplementation of DeViSe in fastai v1

And here is after part 2 of the course:

github.com

fastai/fastai/blob/master/courses/dl2/devise.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Important: This notebook will only work with fastai-0.7.x. Do not try to run any fastai-1.x code from this path in the repository because it will load fastai-0.7.x**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "%reload_ext autoreload\n",
    "%autoreload 2"
   ]
  },

This file has been truncated. show original

Eventually I’ll be redoing it in fastai 2.0

youcefjd · November 18, 2019, 5:24pm

BTW the weights extracted from the few lines of code that you’ve provided are not word embeddings (I’ve tried them before). I honestly don’t know what they are.

knesgood · November 18, 2019, 9:20pm

@youcefjd - I… er… well, that’s embarrassing. Sorry about that, mate!

msivanes · July 2, 2020, 1:14am

Here’s a colab[1] that gets normalized embeddings from the encoder.

from torch.nn import functional as F

def get_normalized_embeddings():
  return F.normalize(lang_mod.model[0].encoder.weight)

def most_similar(token, embs):
  idx = data_lm.vocab.itos.index(token)
  sims = (embs[idx] @ embs.t()).cpu().detach().numpy()

  print(f'Similar to: {token}')
  for sim_idx in np.argsort(sims)[::-1][1:11]:
    print(f'{data_lm.vocab.itos[sim_idx]:<30}{sims[sim_idx]:.02f}')

[1] https://blog.datascienceheroes.com/spam-detection-using-fastai-ulmfit-part-1-language-model/

yogi1 · May 15, 2021, 12:41am

@msivanes Your answer is really helpful. Thank you.