Using a language model for document embeddings

We tried to use a language model to find similar documents. It didn’t work.

Hi everyone! I work for the Quartz AI Studio where we experiment with AI techniques for news reporting.

We’d been thinking about how to help investigative journalists search through hundreds of thousands of leaked documents – like we did for this project using doc2vec. Except we wanted to use a more modern, technique for calculating document similarity, instead of doc2vec.

We had a lot of hope for this experiment, but it didn’t seem to work well. We wanted to share it for discussion – and to see if you all have any ideas why it failed.

Here’s what we tried:

We took apart a Fastai classifier model and put it back together again to generate a 1,200-dimensional embedding of the input document (instead of a binary classification). (See notebook). That 1,200 dimensional embedding is identical to the input to the linear portion of the classification model; we’ve just lopped off the bottom few layers.

Once we have that embedding, we can compare different documents’ embeddings with cosine similarity to have a way to measure the similarity of documents, so we can do an unsupervised clustering of documents.

Practically, if we imagine that we’re investigative journalists with a pile of documents too big to read, we can find an interesting document manually, then ask the computer to find us more similar documents.

But it didn’t work very well.

We trained that language model (with default settings) on a 4,000 page set of documents from New York City mayor Bill de Blasio’s administration (PDF), treating each page as a separate document. Then, rather than providing the language model to a text classification learner, we wrote custom code to get the 400-element vector that the language model outputs for each input document and concatenated it with an average vector and a max-pooled vector. We stored each document’s 1200-element vector (in an ordinary Python list).

To see how well the model is doing, I got the list of pages that contained the words “homeless” or “affordable housing” – to approximate labeled validation data. Then, I picked three pages, ran them through the language model and averaged their outputs vectors; my hope is that this vector would be in the neighborhood of the model’s concept of the homelessness/affordable housing issue. Finally, we ranked every other document in the dataset by its cosine similarity to that averaged vector.

I hoped that most of the top 30 or so most-similar documents would be in the list of documents containing either the word “homeless” or “affordable housing”. But they weren’t. Only about 30% were in that list; the ones that weren’t on the list were truly irrelevant, like a Politico article about Sen. Mary Landrieu’s re-election prospects in Louisiana or the mayor’s universal pre-kindergarten plans. (Theoretically, documents that weren’t in that list might still be a true positive if they’re about the topic, even if they don’t use either phrase.)

So what went wrong?

I don’t know. Do you have any ideas?

One hunch of mine is that the documents are too long (hundreds of words) and the LSTMs in the language model only take into account context going back a smaller number of words. So maybe it would work better with shorter documents. Does that make sense?


I can’t get that link you posted about your latest work to view on my browser. I get a 404 - on github that usually means no permissions. Can you share that freely?

I want to try your method first before posting too many comments. There are some things that don’t quite seem right, but they could all be red herrings. I would rather try out your code then come back with concrete thoughts.

@bfarzin. Ha, oops. Sorry about that. I had forgotten to make the repo public. Thanks for offering to take a look.

Hi @jbfm, sounds like an interesting project! I’m working on something similar - just getting started so nothing to share yet, and I’d love to see your code if you make it public.

Indeed length of the documents might be a problem - I think the default setting is that a LM learner would look at sequences of 70 tokens only.

Another intuition is that maybe cosine similarity is not able to capture the type of document properties you’re looking for? I don’t know what’s happening in the multidimensional space represented by the document embedding vector, but even if “homelessness” is one of the features it covers and keeps the “homelessness” documents close, there are hundreds of other features which may keep them far away… When we train a classifier, the model learns which features it should attend to that are relevant for the problem we’re trying to solve, but cosine similarity would have no clue which features are important. Good luck with the project, please share when you find out how to solve this!

1 Like

Thanks @darek.kleczek. The code is here:

That’s a very helpful point about 70 token sequences. Do you remember where you saw that? (I believe you! I just wasn’t able to find it myself in the docs and want to try adjusting it.) I want to try this same methodology with another set of smaller documents (political Facebook ads from the USA).

The reason I’m optimistic about cosine similarity working is that it has worked on similar document sets in the past. It’s what I used with doc2vec in the Mauritius project linked above; I’ve also used it with Universal Sentence Encoder vectors for this same NYC document set, and it works well.

I believe this is the bptt argument that you pass to TextDataBunch - Thanks for the code, I’ll review it later during the week!

What does the language model learn? Without training it on a classification task, the hidden state of the language model encodes information that might be useful for predicting the subsequent word. It would probably be a bit of a stretch to hope that this information can be used for producing document embeddings.

I am not sure what relevant literature I could point you to, but quick google search revealed this: Probably following the links from that page could lead you to relevant research and suggest what tasks you could train a modified model on to produce more useful document embeddings!


I’m working on something very similar, but using TransformerXL instead of LSTM.

My problem for some reason is that the learner seems to accumulate some state and grow in memory every time I ask it to encode a string, even though I used torch.no_grad and .model.eval(). After about 30 documents it grows GB in size and I have to kill it.

Also, I’m not sure either how to go from word embeddings to document embeddings. Also adapted the pooled linear classifier, but I’m confused as to how faithfully it can represent a document.

Hey @jbfm did you have any progress on this project? Do you know what’s the best pre-trained model for encoding documents into vectors nowadays? My intent is to check how well document embedding performs for detecting plagiarism.

I gave up on this approach, but I’ve had success with using Google’s Universal Sentence Encoder for finding similar documents:

I’d be interested to hear if it works for you for plagiarism detection.

I split documents into sentences, indexed the USE vector, then found similar vectors to a query sentence. So there was no training involved. There’s code linked from the blogpost above.