NPL: Using fastai word embeddings to cluster unlabeled documents


(Nathan) #1

I have a question on the best way to use fastai word embeddings to cluster unlabeled documents.

Say for example I have 100 documents that result in a vocabulary of 500 words. Using the excellent fastai LM learner I can create word embeddings for the 500 word vocab such that I have 500 vectors of length 400 representing the 500 distinct words found in the corpus.

learn.model._modules['0'].weights.shape
-> torch.Size([500, 400])

I can then take each document and substitute the word vector for each word in the sentence. This gives me a document of word embeddings created by the fastai LM learner. I then want to pass the collection of documents to a clustering algorithm such as Scikit-learn’s K-mean clustering algorithm.

The question is what is the best way to combine/condense/aggregate the collection of word vectors that make up each document, so that a single vector representing each document can be passed to the K-means algorithm for processing?

I’ve done this before successfully utilizing Doc2Vec, but I think that fastai will be a better, more powerful solution due to the development and tuning of the language model.

Any advice is welcome, or if I’m barking up the wrong tree I’d be happy to hear about that too. For example in this post Jeremy mentions fastai creates a whole model and not just embeddings. However, it isn’t readily apparent to me how the model itself could be utilized to represent a whole document numerically, and so the tuned word vectors seemed liked the best way to take advantage of the benefits of fastai in this instance.

Thanks!