Using Language model from ULMFit as an auto-encoder

I think this would be a nice application of the “custom head” procedure that is one of fastai’s basic features. We should have a “custom-head”-zoo :slight_smile:

Regarding effectiveness, I wouldn’t be able to tell you, why don’t you just go ahead and try that? I will watch this thread :slight_smile:

Thanks for your reply @seb0 . I guess in absence of any more information I might just give this a go and see what happens as you suggest!

I’m already familiar with PyTorch so I had planned to just extract the underlying model from the fastai library and use this, but if you happen to know a way already then please let me know. I guess it’s not so much a custom head, but no head! Just using the bare encoding before any fully connected layer is added is my plan.

1 Like

Hi @safekidda.

Have you got any results for this? I currently want to compare a short text to find the most closely related ones from a lookup table via their document embeddings?

However, I’m not sure where to get these embeddings from to perform the cosine similarity?

1 Like

Hey. You can’t swap and change embeddings I’m afraid. Bear in mind that embeddings are just (a part of) model weights. So in order to compare your short sentence, you need to compare it with weights from the same model.

AFAIK, document vectors can be arrived at in a couple of different ways. It may be that you can just generate document vectors for your short text, but I’m not sure TBH. If it’s gensim doc2vec then try this:

https://groups.google.com/forum/m/#!topic/gensim/Fujja7aOH6E

Hi

I thought one can get the embeddings from the language model somehow…?!? But I’m not really sure…

My idea was: for each dataset get me the embeddings (I think in the link they average them) and store them in a dict. Then later, calc the embeddings for a given text sample and finally look of the closest one from the stored dict…

But maybe I don’t get something …

Cheers

I think you’re getting mixed up with embeddings and encodings. Embeddings in this scenario only work at word level, and you won’t have much luck averaging the words of a long document. The article above and what I’m talking about is taking the representation of the sentence or document from the model by treating the model as an auto-encoder, i.e. taking the vector for each document from the last hidden layer after the activation.

If that’s what you want to do, then we’re in the same boat. It shouldn’t be too hard. In native PyTorch it’s a doddle, so it’ll just be a case of finding out the supported way in the fastai library without having to hack the source code. I think someone has mentioned hooks. It really shouldn’t be that hard. Really the hard bit is all laid out for us which is creating and fine-tuning a language model (fine tuning will be necessary to make this work well).

I’ve got side-tracked with other projects but when I get round to it I’ll report back with my findings and source code. Alternatively why not try and follow along with the medium article above? I haven’t read it so I don’t know if it’s any good.

1 Like

Thanks. Yes, I think I have my terminology mixed up here… I will try to have a look if I can make sense of it…

Thanks and I let me know if you make any advances in this direction

Cheers,
C

Hi,
I’m trying to convert a sentence into a vector using ULMFiT’s encoder. Did you do it? I’m a noob to this and any guidance is much appreciated.
Thanks.

Hi,

I have to use ULMFiT as an auto-encoder to compare the reference sentences with the machine generated sentences. Did anyone worked on it?

hey, Nick. Just seeing if you ever tried this.

I’m afraid I did not. Though were I to do it now I’d probably go with BERT.

Pytorch Hub or Hugging face repo would be a good place to start…

2 Likes

Fascinating thread. Did anyone successfully extract embedding vectors from ULMFiT Language Model?

Also, what is the difference between model weights and embedding vectors?

@youcefjd I ran into the same issue, was able to figure it out eventually. Wrote up the solution here.

Short answer is:

def process_doc(learn, doc):
    xb, yb = learn.data.one_item(doc)
    return xb

def encode_doc(learn, doc):
    xb = process_doc(learn, doc)
    # Reset initializes the hidden state
    awd_lstm = learn.model[0]
    awd_lstm.reset()
    with torch.no_grad():
        out = awd_lstm.eval()(xb)
    # Return final output, for last RNN, on last token in sequence
    return out[0][2][0][-1].detach().numpy()
2 Likes

@Alden Which version of fastAI you are using? I can’t find learn.data.one_item in LanguageModelData class.

@najaf I’m using 1.0.58.

Learn.data should be a TextLMDataBunch - one_item inherits from DataBunch, see here.

How are you creating your data and learner? Ideally you’d use TextLMDataBunch constructor class methods (e.g. TextLMDataBunch.from_csv) to create your dataset. If you’re creating it from a TextList, I believe calling data.label_for_lm().databunch() after splitting should also give you a TextLMDataBunch.

If it helps, here’s how I’m creating my learner:

from fastai.text.data import load_data
from fastai.text.models.awd_lstm import AWD_LSTM
from fastai.text.learner import language_model_learner

db = load_data(path, 'my_databunch.pkl', bs=64, bptt=80)
learner = language_model_learner(db, AWD_LSTM)
learner = learner.load('my_saved_learner', with_opt=True)

This is a really interesting idea.
Something similar has been explored in DSSM.
The difference is how the embeddings are being generated.

However, there seems to some confusion regarding the terminologies which I would like an attempt at clarifying.

FastAI LM vs Classifier
To make this happen we’re neither gonna be using the language model nor the classifier.
I’m a little unclear if calling this an anutoencoder (my opinion is not) is correct so I’m not gonna comment on that.

In the fastai context:
First, what’s the difference between a language model and classifier?
Ans: Only the last set of layers.
If you try comparing the model (you can do that by comparing learn_lm.model and learn.model)

The part that’s same is the MultiBatchEncoder and the lm uses the LinearDecoder and the classifier uses the PoolingLinearClassifier. Now just for curiosity if you wanna understand what’s the essential difference between the two. Try looking at the last entry in the LinearDecoder and PoolingLinearClassifier.

PoolingLinearClassifier:
(6): Linear(in_features=50, out_features=2, bias=True)

What this means is it’s taking in a vector of size 50 and giving an output of size two.

LinearDecoder:
(decoder): Linear(in_features=400, out_features=60000, bias=True)

Taking input of size 400 and outputting a vector of size 60000.
This is where softmax is being applied. In the former, the output will either be True or False. In the latter, the output is the next word. One word out of a vocab of 60K.

Coming back
Coming back to how I think this kind of thing can be implemented.
Inspiration from the fastai.collab can be used here.

First, let’s check out the kind of data we could be dealing with.
For a set of queries and documents, there’s a score as to how relevant each document is to each query. A bit like from the colab module example from the course, how much does a single user like the movies on a scale of 1-5.

Now, what is the model going to look like?
We just need the MultiBatchEncoder. What this results in is we can feed a document or a query to the model and we get a vector of size 400 for each.

Now what will the forward pass look like:
We take a query document pair, we pass both through the model. Getting two vectors of size 400. Here, taking inspiration from the colab module we can take a dot product of the vectors.

The Loss
The dot product will give us one score.
Now given the actual and predicted score are normalized we want the predicted score to be closer to the actual score. The colab module uses MSELoss which is the way to go in my opinion.

So what does the big picture look like (TL;DR)?
This is exactly like the colab module. But we’ve replaced the randomly initilized embeddings in colab with the MultiBatchEncoder from the text module. And of course we first train the language model on all of the documents.

Conclusion
It should be interesting work.
I don’t have enough time to implement this on my own. So if anyone’s looking to collab-orate I’d like to.
As far as I understand the trickiest part is gonna be the DataBunch.
Working with fastai somehow the databunch is really the hardest part to get through.

Hi Alden,
Thanks. I’ve tired using both the functions you created but they didn’t quite work in my context:
I want to extract embedding vectors from a dataset of n items. When I apply encode_doc to the dataset I get a vector size of 400 (the sum of n vectors encoded as one single item), whereas I want a matrix of shape (n,400). Is there a way to do it?
Thanks again

Hey @youcefjd,

I don’t know for sure, but process_doc is intended for a single string because of its use of the one_item method. If speed isn’t a constraint in your use case, you could just do a list comprehension like:
[encode_doc(doc) for doc in my_dataset]

I’m sure there’s a much more efficient way to do it if you can get all of your docs into one batch and pass that through the AWD LSTM - you’d have to change encode_doc to return the entire batch rather than the first item (in my case I was assuming the batch had a single item).

Not sure if someone is still struggling with this; I have been and couldn’t find what I was looking for on the forums.

This is what I’ve come up with and it made processing a fairly large dataset pretty painless. (<3 minutes when sending data in batches vs ~2hours send one document at a time).

I load my classification model, then add the data I want scored as a test set:

learn = load_learner('classification_model')
full_TL = TextList.from_df(df = newUtt_PD, path=path, cols=['Word'])
learn.data.add_test(full_TL)

I then pull out the data loader again (this feels very ‘hackey’, I’m sure there is a better way)

dl1 = learn.data.dl(ds_type=DatasetType.Test)

I then altered the functions above to now just be:

def getembs(mod, btch):
    res=[]
    res.append(mod(btch)[0][2].max(1).values.cpu().detach().numpy())
    return res

awd_lstm = learn.model[0]
awd_lstm.reset()

Now I can just call all of it and then stack the results:

batches = [getembs(awd_lstm, i[0])[0] for i in dl1]
encodings = np.vstack(batches)

I then ran the encodings through TSNE and got pretty pictures like this;

1 Like