Unsupervised short sentences clustering with fastai

uw198162 · October 16, 2020, 8:12am

Hi!

I am working on a proof-of-concept application to cluster short sentences, in English for now, but the idea is to be able to expand the model to be language-independent. As a fallback, I’m thinking of doing a specialized per-language solution vs a distinct solution for the other languages, perhaps using subword tokenization.

Here’s an example of what I’m trying to achieve. Given the following “feature requests” from hypotetical users:

Support book reservation online
Support more search options
Support book reservation and search improvement in two phases
Allow user to save the search result
Support book reservation
Reserve a librarian as supervisor
Improve book collection
Buy new books
Add books written in other languages
Repair damaged books
Review all the books to find out those damaged
…

I would like to extract clusters of sentences, something along the lines of:

Support book reservation online
Support book reservation and search improvement in two phases
Support book reservation

Support more search options
Allow user to save the search result

Improve book collection
Buy new books
Add books written in other languages

Repair damaged books
Review all the books to find out those damaged

Referral program

Reserve a librarian as supervisor

My current approach is using Spacy for sentence text cleanup and vectorization, and then using sklearn.cluster.KMeans for clustering of the vectors. Here’s a summary of the code I’m using:

"""
Basic text clustering of short sentences
Inspired by https://www.kaggle.com/joehalliwell/document-clustering
"""

import numpy as np
import spacy
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

nlp: spacy = spacy.load("en_core_web_lg")

def is_pronoun(lemma: str) -> bool:
    return lemma == "-PRON-"

def process_text(text: str) -> str:
    doc = nlp(text.lower())
    result = []
    for token in doc:
        if token.text in nlp.Defaults.stop_words:
            continue
        if token.is_punct:
            continue
        if is_pronoun(token.lemma_):
            continue
        result.append(token.lemma_)
    return " ".join(result)

def vectorize(text):
    return nlp(text, disable=['parser', 'tagger', 'ner']).vector

def get_clusters(sentences, clusters=24):
    X = normalize(np.stack(vectorize(process_text(t)) for t in sentences))
    kmeans = KMeans(n_clusters=clusters, max_iter=250, init='kmeans++', random_state=1)
    kmeans.fit(X)
    # Predictions are a number, in sentence order, where each number corresponds to a cluster
    predictions = kmeans.predict(X)
    return predictions

This approach works decently, but I’d like to evaluate a deep learning model as well, based on fastai. I haven’t found anything specifically about clustering in the fastai book. I found few posts here in the forums but I’m not sure how to apply a language model to the clustering problem. What would be the outputs of such model? How would I train it? A cluster number doesn’t encode any value, it’s just the “co-occurrence” of the same cluster number for related sentences that make it a “good” value, if that makes sense.

Any suggestions on how to proceed?

Thank you!

orendar · October 16, 2020, 10:53am

Hey Cosimo,

Unfortunately fastai does not provide any out-of-the-box support for clustering. You could try training a language model and then using the embeddings for encoding sentences as vectors if you want to experiment with that approach.

If you want a more effective solution for clustering generic text, then in my opinion a good approach would be to encode the sentences using sentence-transformers, and then perform clustering using either hdbscan for small-scale data or FAISS for larger-scale data.

uw198162 · October 18, 2020, 9:51am

Hi Oren,

thanks for the useful pointers.
I’ve had some time to try what you suggested and it seems to work well.
I will try to set up a comprehensive benchmark to evaluate the performance of each variant.

–
Cosimo

uw198162 · November 3, 2020, 9:03am

It seems that this post could be quite useful for what I’m trying to do:

Will just leave it here in case someone else stumbles upon my request.

uw198162 · May 4, 2022, 10:57am

After a year and a half, I did a write-up of the project. I thought I would mention it here, other people might be interested:

Thank you again Oren. Your indications got me on a good path.
Feedback is welcome!