Using Language model from ULMFit as an auto-encoder

(Nick) #1


I’m just wondering how effective it would be use a fine-tuned language model as an auto-encoder instead of turning it into a classifier?

Specifically, I have some search terms (1-10 words) and I want to match them to some longer articles (<500 words). So I was thinking about using the final hidden layer of a language model to return a vector per search term or article. Then measure the distances between all the vectors to find relevance.

It’s probably a sledgehammer to crack a nut, but alternative methods such as document vectors strike me as a little dumb. They don’t average well, and usually aren’t fine tuned to my overall corpus. So I was hoping this would be a more effective way.


(Nick) #2



(Sebastian Fleck) #3

I think this would be a nice application of the “custom head” procedure that is one of fastai’s basic features. We should have a “custom-head”-zoo :slight_smile:

Regarding effectiveness, I wouldn’t be able to tell you, why don’t you just go ahead and try that? I will watch this thread :slight_smile:


(Nick) #4

Thanks for your reply @seb0 . I guess in absence of any more information I might just give this a go and see what happens as you suggest!

I’m already familiar with PyTorch so I had planned to just extract the underlying model from the fastai library and use this, but if you happen to know a way already then please let me know. I guess it’s not so much a custom head, but no head! Just using the bare encoding before any fully connected layer is added is my plan.

1 Like

(Christian Werner) #5

Hi @safekidda.

Have you got any results for this? I currently want to compare a short text to find the most closely related ones from a lookup table via their document embeddings?

However, I’m not sure where to get these embeddings from to perform the cosine similarity?

1 Like

(Nick) #6

Hey. You can’t swap and change embeddings I’m afraid. Bear in mind that embeddings are just (a part of) model weights. So in order to compare your short sentence, you need to compare it with weights from the same model.

AFAIK, document vectors can be arrived at in a couple of different ways. It may be that you can just generate document vectors for your short text, but I’m not sure TBH. If it’s gensim doc2vec then try this:!topic/gensim/Fujja7aOH6E


(Christian Werner) #7


I thought one can get the embeddings from the language model somehow…?!? But I’m not really sure…

My idea was: for each dataset get me the embeddings (I think in the link they average them) and store them in a dict. Then later, calc the embeddings for a given text sample and finally look of the closest one from the stored dict…

But maybe I don’t get something …



(Nick) #8

I think you’re getting mixed up with embeddings and encodings. Embeddings in this scenario only work at word level, and you won’t have much luck averaging the words of a long document. The article above and what I’m talking about is taking the representation of the sentence or document from the model by treating the model as an auto-encoder, i.e. taking the vector for each document from the last hidden layer after the activation.

If that’s what you want to do, then we’re in the same boat. It shouldn’t be too hard. In native PyTorch it’s a doddle, so it’ll just be a case of finding out the supported way in the fastai library without having to hack the source code. I think someone has mentioned hooks. It really shouldn’t be that hard. Really the hard bit is all laid out for us which is creating and fine-tuning a language model (fine tuning will be necessary to make this work well).

I’ve got side-tracked with other projects but when I get round to it I’ll report back with my findings and source code. Alternatively why not try and follow along with the medium article above? I haven’t read it so I don’t know if it’s any good.

1 Like

(Christian Werner) #9

Thanks. Yes, I think I have my terminology mixed up here… I will try to have a look if I can make sense of it…

Thanks and I let me know if you make any advances in this direction



(Thomas Paul) #10

I’m trying to convert a sentence into a vector using ULMFiT’s encoder. Did you do it? I’m a noob to this and any guidance is much appreciated.


(Sanjita Suresh) #11


I have to use ULMFiT as an auto-encoder to compare the reference sentences with the machine generated sentences. Did anyone worked on it?


(p) #12

hey, Nick. Just seeing if you ever tried this.


(Nick) #13

I’m afraid I did not. Though were I to do it now I’d probably go with BERT.

Pytorch Hub or Hugging face repo would be a good place to start…


(Youcef Djeddar) #14

Fascinating thread. Did anyone successfully extract embedding vectors from ULMFiT Language Model?


(Youcef Djeddar) #15

Also, what is the difference between model weights and embedding vectors?



@youcefjd I ran into the same issue, was able to figure it out eventually. Wrote up the solution here.

Short answer is:

def process_doc(learn, doc):
    xb, yb =
    return xb

def encode_doc(learn, doc):
    xb = process_doc(learn, doc)
    # Reset initializes the hidden state
    awd_lstm = learn.model[0]
    with torch.no_grad():
        out = awd_lstm.eval()(xb)
    # Return final output, for last RNN, on last token in sequence
    return out[0][2][0][-1].detach().numpy()
1 Like

(Najaf Murtaza) #17

@Alden Which version of fastAI you are using? I can’t find in LanguageModelData class.



@najaf I’m using 1.0.58. should be a TextLMDataBunch - one_item inherits from DataBunch, see here.

How are you creating your data and learner? Ideally you’d use TextLMDataBunch constructor class methods (e.g. TextLMDataBunch.from_csv) to create your dataset. If you’re creating it from a TextList, I believe calling data.label_for_lm().databunch() after splitting should also give you a TextLMDataBunch.

If it helps, here’s how I’m creating my learner:

from import load_data
from fastai.text.models.awd_lstm import AWD_LSTM
from fastai.text.learner import language_model_learner

db = load_data(path, 'my_databunch.pkl', bs=64, bptt=80)
learner = language_model_learner(db, AWD_LSTM)
learner = learner.load('my_saved_learner', with_opt=True)