Hi,
Is it possible to extract embedding vectors from ULMFiT language model? If so, how? Did anyone try it before? Is it truly more efficient than BERT?
Thanks!
Hi,
Is it possible to extract embedding vectors from ULMFiT language model? If so, how? Did anyone try it before? Is it truly more efficient than BERT?
Thanks!
This thread might be of interest:
That was…me who asked the question. But thanks anyways
maybe I’m being too simple. but at the end of training. if you just take the embedding.weight.data you should get embeddings as an array mapped to every word.
Hi, thanks but it’s not working. Any other thoughts?
Are you saying you’re having issues accessing that weight matrix? If so, here is how I got to those weights (Assume that I created an AWD_LSTM model and did a learn.load_encoder()
net = learn.model
encoder = net[0]
enc = list(encoder.children())[0]
w = enc.encoder.weight
And then if you wanted to map them back you can see that your databunch’s vocab has the same length.
Eg: (assume IMDB_Sample)
len(learn.data.vocab[0])
7080
len(w)
7080
(There could be a simpler way than that to index them, did it on the fly)
If we need to work on sentence similarity , which embedding will you choose. The embeddings from the Language model or the embeddings from the classifier.
The embedding layer is the first layer of the entire network. Through backpropagation it gets updated. So you would take the weights of the embedding layer
To add to this, they’d both be extremely similar. If you don’t have a classifier the grab the language models weights. (Which is extremely similar to how I described above)
Thanks. Can those weights be used for similarity search? Same way as Image2vec for instance.
I’m not very familiar with similarity search so I would not be able to give a good answer on that (I’m getting to Image2Vec in the next few weeks) but I’d imagine they sound quite similar so possibly? There’s a fastai 1.0 version of Image2Vec out there that I have seen that perhaps you could start with?
Oh there is? I haven’t seen it at all. Do you have any good links?
Here is a 1.0: MultiFloatList - reimplementation of DeViSe in fastai v1
And here is after part 2 of the course:
Eventually I’ll be redoing it in fastai 2.0
BTW the weights extracted from the few lines of code that you’ve provided are not word embeddings (I’ve tried them before). I honestly don’t know what they are.
Here’s a colab[1] that gets normalized embeddings from the encoder.
from torch.nn import functional as F
def get_normalized_embeddings():
return F.normalize(lang_mod.model[0].encoder.weight)
def most_similar(token, embs):
idx = data_lm.vocab.itos.index(token)
sims = (embs[idx] @ embs.t()).cpu().detach().numpy()
print(f'Similar to: {token}')
for sim_idx in np.argsort(sims)[::-1][1:11]:
print(f'{data_lm.vocab.itos[sim_idx]:<30}{sims[sim_idx]:.02f}')
[1] https://blog.datascienceheroes.com/spam-detection-using-fastai-ulmfit-part-1-language-model/