Extracting word embeddings from ULMFiT

Hi,

Is it possible to extract embedding vectors from ULMFiT language model? If so, how? Did anyone try it before? Is it truly more efficient than BERT?

Thanks!

1 Like

This thread might be of interest:

1 Like

That was…me who asked the question. But thanks anyways

1 Like

maybe I’m being too simple. but at the end of training. if you just take the embedding.weight.data you should get embeddings as an array mapped to every word.

2 Likes

Hi, thanks but it’s not working. Any other thoughts?

Are you saying you’re having issues accessing that weight matrix? If so, here is how I got to those weights (Assume that I created an AWD_LSTM model and did a learn.load_encoder()

net = learn.model
encoder = net[0]
enc = list(encoder.children())[0]
w = enc.encoder.weight

And then if you wanted to map them back you can see that your databunch’s vocab has the same length.

Eg: (assume IMDB_Sample)

len(learn.data.vocab[0])
7080

len(w)
7080

(There could be a simpler way than that to index them, did it on the fly)

1 Like

If we need to work on sentence similarity , which embedding will you choose. The embeddings from the Language model or the embeddings from the classifier.

The embedding layer is the first layer of the entire network. Through backpropagation it gets updated. So you would take the weights of the embedding layer

1 Like

To add to this, they’d both be extremely similar. If you don’t have a classifier the grab the language models weights. (Which is extremely similar to how I described above)

2 Likes

Thanks. Can those weights be used for similarity search? Same way as Image2vec for instance.

I’m not very familiar with similarity search so I would not be able to give a good answer on that (I’m getting to Image2Vec in the next few weeks) but I’d imagine they sound quite similar so possibly? There’s a fastai 1.0 version of Image2Vec out there that I have seen that perhaps you could start with? :slight_smile:

Oh there is? I haven’t seen it at all. Do you have any good links?

Here is a 1.0: MultiFloatList - reimplementation of DeViSe in fastai v1

And here is after part 2 of the course:

Eventually I’ll be redoing it in fastai 2.0

2 Likes

BTW the weights extracted from the few lines of code that you’ve provided are not word embeddings (I’ve tried them before). I honestly don’t know what they are.

@youcefjd - I… er… well, that’s embarrassing. Sorry about that, mate!

1 Like

Here’s a colab[1] that gets normalized embeddings from the encoder.

from torch.nn import functional as F

def get_normalized_embeddings():
  return F.normalize(lang_mod.model[0].encoder.weight)

def most_similar(token, embs):
  idx = data_lm.vocab.itos.index(token)
  sims = (embs[idx] @ embs.t()).cpu().detach().numpy()

  print(f'Similar to: {token}')
  for sim_idx in np.argsort(sims)[::-1][1:11]:
    print(f'{data_lm.vocab.itos[sim_idx]:<30}{sims[sim_idx]:.02f}')

[1] https://blog.datascienceheroes.com/spam-detection-using-fastai-ulmfit-part-1-language-model/

3 Likes

@msivanes Your answer is really helpful. Thank you.