NPL: Using fastai word embeddings to cluster unlabeled documents

I have a question on the best way to use fastai word embeddings to cluster unlabeled documents.

Say for example I have 100 documents that result in a vocabulary of 500 words. Using the excellent fastai LM learner I can create word embeddings for the 500 word vocab such that I have 500 vectors of length 400 representing the 500 distinct words found in the corpus.

-> torch.Size([500, 400])

I can then take each document and substitute the word vector for each word in the sentence. This gives me a document of word embeddings created by the fastai LM learner. I then want to pass the collection of documents to a clustering algorithm such as Scikit-learn’s K-mean clustering algorithm.

The question is what is the best way to combine/condense/aggregate the collection of word vectors that make up each document, so that a single vector representing each document can be passed to the K-means algorithm for processing?

I’ve done this before successfully utilizing Doc2Vec, but I think that fastai will be a better, more powerful solution due to the development and tuning of the language model.

Any advice is welcome, or if I’m barking up the wrong tree I’d be happy to hear about that too. For example in this post Jeremy mentions fastai creates a whole model and not just embeddings. However, it isn’t readily apparent to me how the model itself could be utilized to represent a whole document numerically, and so the tuned word vectors seemed liked the best way to take advantage of the benefits of fastai in this instance.




I wanted to follow up on this, since I’ve been very busy and hadn’t looped back on it. I’m going to provide a high level overview of how I approached this project, and if anyone wants more details I’d be happy to provide them.

I was working on unsupervised classification of over 250K parole complaint revocation narratives written in natural language for the Colorado Dept. of Corrections (DOC) over a several year span. I had previously worked on this task utilizing Doc2Vec, and I wanted to see how Fast AI’s language model would perform.

I started by first examining and pre-processed the data (i.e. removing null records and so forth). Next I cleaned the text. This was a bit different from the IMDb sentiment workflow, as I didn’t need a language model that “understood” the connections between words and meanings. Since this was more of a classification task I found that the model actually suffered from too much filler text, stop-words, and non-categorical verbiage.

Ex: There were three records for offender X, and two of those records mentioned controlled substance charges. The third record only made brief mention of a deadly weapon, but included X’s name. The algorithm wanted to cluster the record on offender X’s name instead of grouping the record in with the other weapons related documents. Removing all names from the data set for example allowed the model to correctly identify the last record and group it correctly.

The cleaning process I found to be effective was as follows:

  1. Removed line breaks and other formatting items

  2. Removed slashes as there were many references to items such as "drug/alcohol"

  3. Removed non-alpha characters

  4. Removed English stop works except for the word "not" as this was found in meaningful descriptions such as "… controlled substance possession charges not weapons related …"

  5. Fixed words incorrectly joined by lack of spaces. (Ex: "… while\nincustody" should be "while in custody")

  6. Fixed spelling mistakes

  7. I also ran everything through a tailored list of custom stop words specific to the DOC in order to help the model focus in the important text only and de-noise the data

I took the outputs of the steps above and performed tokenization/numericalization, created a 80/20 train/validation split, and then built and trained the language model.

This gave me a collection of tokenized vocabulary words mapped to numerical IDs which were in turn mapped to 200 dimension matrices created by the language model. I took the collection of documents I wanted to cluster and substituted each vocabulary word found within into the matching word matrix.

At this point I knew I needed to aggregate all the word matrices that made up a particular document, so that the resulting structure could be fed into the Scikit-Learn K-Means clustering algorithm for clustering. I tried both stacking the matrices and averaging the values in order to achieve this. I found that stacking the matrices resulting in much more accurate and sensible results, and so I finalized on this method.

After clustering the documents I created word clouds and graphs showing the most commonly utilized words in the clustered documents, and I was able to perform meaningful analysis on the results as well as provide recommendations for improvement. I’m happy to say the results were well received within the DOC. This was the first time machine learning was being used in a visible way, and so I wanted to work overtime to ensure the first foray was a positive experience for the organization. Also, a number of changes to the parole revocation system are being planned as a result of this project which is very satisfying professionally.

As I said before if anyone has any questions or needs more details please let me know.

Thank you.


I read this just now, but I wantade to thank you for sharing this anyway

You are very welcome; thank you for taking the time to read it. :slight_smile:

Hi Nathan,
Thank you for your post. I am currently working on a similar project where I want to cluster articles using the embeddings from the language model. The approach I have in mind is similar to what you suggested initially, take the 400-dim word vectors and get a condensed vector representation of the article. Could you let me know if this approach has any drawbacks? I am interested to know more about your approach and implementation

1 Like

Super helpful, thank you! Do you have any code that you’re able to share that shows how to go about gathering all the document matricies and joining them together alone with the tokenization and numericalization data for the documents on data bunch creation? Basically,I’m not sure how to get the tokenization/numericaliation & other relevant data into a form that is then ready to be fed into scikit for a k-means operation.

Sample code would help me as I’m very new to fastai and data science in Python.

Thank you.