We tried to use a language model to find similar documents. It didn’t work.
Hi everyone! I work for the Quartz AI Studio where we experiment with AI techniques for news reporting.
We’d been thinking about how to help investigative journalists search through hundreds of thousands of leaked documents – like we did for this project using doc2vec. Except we wanted to use a more modern, Fast.ai-based technique for calculating document similarity, instead of doc2vec.
We had a lot of hope for this experiment, but it didn’t seem to work well. We wanted to share it for discussion – and to see if you all have any ideas why it failed.
Here’s what we tried:
We took apart a Fastai classifier model and put it back together again to generate a 1,200-dimensional embedding of the input document (instead of a binary classification). (See notebook). That 1,200 dimensional embedding is identical to the input to the linear portion of the classification model; we’ve just lopped off the bottom few layers.
Once we have that embedding, we can compare different documents’ embeddings with cosine similarity to have a way to measure the similarity of documents, so we can do an unsupervised clustering of documents.
Practically, if we imagine that we’re investigative journalists with a pile of documents too big to read, we can find an interesting document manually, then ask the computer to find us more similar documents.
But it didn’t work very well.
We trained that language model (with default settings) on a 4,000 page set of documents from New York City mayor Bill de Blasio’s administration (PDF), treating each page as a separate document. Then, rather than providing the language model to a text classification learner, we wrote custom code to get the 400-element vector that the language model outputs for each input document and concatenated it with an average vector and a max-pooled vector. We stored each document’s 1200-element vector (in an ordinary Python list).
To see how well the model is doing, I got the list of pages that contained the words “homeless” or “affordable housing” – to approximate labeled validation data. Then, I picked three pages, ran them through the language model and averaged their outputs vectors; my hope is that this vector would be in the neighborhood of the model’s concept of the homelessness/affordable housing issue. Finally, we ranked every other document in the dataset by its cosine similarity to that averaged vector.
I hoped that most of the top 30 or so most-similar documents would be in the list of documents containing either the word “homeless” or “affordable housing”. But they weren’t. Only about 30% were in that list; the ones that weren’t on the list were truly irrelevant, like a Politico article about Sen. Mary Landrieu’s re-election prospects in Louisiana or the mayor’s universal pre-kindergarten plans. (Theoretically, documents that weren’t in that list might still be a true positive if they’re about the topic, even if they don’t use either phrase.)
So what went wrong?
I don’t know. Do you have any ideas?
One hunch of mine is that the documents are too long (hundreds of words) and the LSTMs in the Fast.ai language model only take into account context going back a smaller number of words. So maybe it would work better with shorter documents. Does that make sense?