I think this would be a nice application of the “custom head” procedure that is one of fastai’s basic features. We should have a “custom-head”-zoo
Regarding effectiveness, I wouldn’t be able to tell you, why don’t you just go ahead and try that? I will watch this thread
Thanks for your reply @seb0 . I guess in absence of any more information I might just give this a go and see what happens as you suggest!
I’m already familiar with PyTorch so I had planned to just extract the underlying model from the fastai library and use this, but if you happen to know a way already then please let me know. I guess it’s not so much a custom head, but no head! Just using the bare encoding before any fully connected layer is added is my plan.
Have you got any results for this? I currently want to compare a short text to find the most closely related ones from a lookup table via their document embeddings?
However, I’m not sure where to get these embeddings from to perform the cosine similarity?
Hey. You can’t swap and change embeddings I’m afraid. Bear in mind that embeddings are just (a part of) model weights. So in order to compare your short sentence, you need to compare it with weights from the same model.
AFAIK, document vectors can be arrived at in a couple of different ways. It may be that you can just generate document vectors for your short text, but I’m not sure TBH. If it’s gensim doc2vec then try this:
I thought one can get the embeddings from the language model somehow…?!? But I’m not really sure…
My idea was: for each dataset get me the embeddings (I think in the link they average them) and store them in a dict. Then later, calc the embeddings for a given text sample and finally look of the closest one from the stored dict…
But maybe I don’t get something …
I think you’re getting mixed up with embeddings and encodings. Embeddings in this scenario only work at word level, and you won’t have much luck averaging the words of a long document. The article above and what I’m talking about is taking the representation of the sentence or document from the model by treating the model as an auto-encoder, i.e. taking the vector for each document from the last hidden layer after the activation.
If that’s what you want to do, then we’re in the same boat. It shouldn’t be too hard. In native PyTorch it’s a doddle, so it’ll just be a case of finding out the supported way in the fastai library without having to hack the source code. I think someone has mentioned hooks. It really shouldn’t be that hard. Really the hard bit is all laid out for us which is creating and fine-tuning a language model (fine tuning will be necessary to make this work well).
I’ve got side-tracked with other projects but when I get round to it I’ll report back with my findings and source code. Alternatively why not try and follow along with the medium article above? I haven’t read it so I don’t know if it’s any good.
Thanks. Yes, I think I have my terminology mixed up here… I will try to have a look if I can make sense of it…
Thanks and I let me know if you make any advances in this direction
I’m trying to convert a sentence into a vector using ULMFiT’s encoder. Did you do it? I’m a noob to this and any guidance is much appreciated.
I have to use ULMFiT as an auto-encoder to compare the reference sentences with the machine generated sentences. Did anyone worked on it?
hey, Nick. Just seeing if you ever tried this.
I’m afraid I did not. Though were I to do it now I’d probably go with BERT.
Pytorch Hub or Hugging face repo would be a good place to start…
Fascinating thread. Did anyone successfully extract embedding vectors from ULMFiT Language Model?
Also, what is the difference between model weights and embedding vectors?
Short answer is:
def process_doc(learn, doc): xb, yb = learn.data.one_item(doc) return xb def encode_doc(learn, doc): xb = process_doc(learn, doc) # Reset initializes the hidden state awd_lstm = learn.model awd_lstm.reset() with torch.no_grad(): out = awd_lstm.eval()(xb) # Return final output, for last RNN, on last token in sequence return out[-1].detach().numpy()
@Alden Which version of fastAI you are using? I can’t find learn.data.one_item in LanguageModelData class.
@najaf I’m using 1.0.58.
Learn.data should be a TextLMDataBunch - one_item inherits from DataBunch, see here.
How are you creating your data and learner? Ideally you’d use
TextLMDataBunch constructor class methods (e.g.
TextLMDataBunch.from_csv) to create your dataset. If you’re creating it from a TextList, I believe calling
data.label_for_lm().databunch() after splitting should also give you a
If it helps, here’s how I’m creating my learner:
from fastai.text.data import load_data from fastai.text.models.awd_lstm import AWD_LSTM from fastai.text.learner import language_model_learner db = load_data(path, 'my_databunch.pkl', bs=64, bptt=80) learner = language_model_learner(db, AWD_LSTM) learner = learner.load('my_saved_learner', with_opt=True)
This is a really interesting idea.
Something similar has been explored in DSSM.
The difference is how the embeddings are being generated.
However, there seems to some confusion regarding the terminologies which I would like an attempt at clarifying.
FastAI LM vs Classifier
To make this happen we’re neither gonna be using the language model nor the classifier.
I’m a little unclear if calling this an anutoencoder (my opinion is not) is correct so I’m not gonna comment on that.
In the fastai context:
First, what’s the difference between a language model and classifier?
Ans: Only the last set of layers.
If you try comparing the model (you can do that by comparing
The part that’s same is the
MultiBatchEncoder and the lm uses the
LinearDecoder and the classifier uses the
PoolingLinearClassifier. Now just for curiosity if you wanna understand what’s the essential difference between the two. Try looking at the last entry in the
(6): Linear(in_features=50, out_features=2, bias=True)
What this means is it’s taking in a vector of size 50 and giving an output of size two.
(decoder): Linear(in_features=400, out_features=60000, bias=True)
Taking input of size 400 and outputting a vector of size 60000.
This is where softmax is being applied. In the former, the output will either be True or False. In the latter, the output is the next word. One word out of a vocab of 60K.
Coming back to how I think this kind of thing can be implemented.
Inspiration from the
fastai.collab can be used here.
First, let’s check out the kind of data we could be dealing with.
For a set of queries and documents, there’s a score as to how relevant each document is to each query. A bit like from the
colab module example from the course, how much does a single user like the movies on a scale of 1-5.
Now, what is the model going to look like?
We just need the
MultiBatchEncoder. What this results in is we can feed a document or a query to the model and we get a vector of size 400 for each.
Now what will the forward pass look like:
We take a query document pair, we pass both through the model. Getting two vectors of size 400. Here, taking inspiration from the
colab module we can take a dot product of the vectors.
The dot product will give us one score.
Now given the actual and predicted score are normalized we want the predicted score to be closer to the actual score. The
colab module uses MSELoss which is the way to go in my opinion.
So what does the big picture look like (TL;DR)?
This is exactly like the
colab module. But we’ve replaced the randomly initilized embeddings in
colab with the
MultiBatchEncoder from the
text module. And of course we first train the language model on all of the documents.
It should be interesting work.
I don’t have enough time to implement this on my own. So if anyone’s looking to
collab-orate I’d like to.
As far as I understand the trickiest part is gonna be the
Working with fastai somehow the databunch is really the hardest part to get through.
Thanks. I’ve tired using both the functions you created but they didn’t quite work in my context:
I want to extract embedding vectors from a dataset of n items. When I apply encode_doc to the dataset I get a vector size of 400 (the sum of n vectors encoded as one single item), whereas I want a matrix of shape (n,400). Is there a way to do it?
I don’t know for sure, but
process_doc is intended for a single string because of its use of the
one_item method. If speed isn’t a constraint in your use case, you could just do a list comprehension like:
[encode_doc(doc) for doc in my_dataset]
I’m sure there’s a much more efficient way to do it if you can get all of your docs into one batch and pass that through the AWD LSTM - you’d have to change
encode_doc to return the entire batch rather than the first item (in my case I was assuming the batch had a single item).