Unsupervised short sentences clustering with fastai

Hi!

I am working on a proof-of-concept application to cluster short sentences, in English for now, but the idea is to be able to expand the model to be language-independent. As a fallback, I’m thinking of doing a specialized per-language solution vs a distinct solution for the other languages, perhaps using subword tokenization.

Here’s an example of what I’m trying to achieve. Given the following “feature requests” from hypotetical users:

  • Support book reservation online
  • Support more search options
  • Support book reservation and search improvement in two phases
  • Allow user to save the search result
  • Support book reservation
  • Reserve a librarian as supervisor
  • Improve book collection
  • Buy new books
  • Add books written in other languages
  • Repair damaged books
  • Review all the books to find out those damaged

I would like to extract clusters of sentences, something along the lines of:

  • Support book reservation online
  • Support book reservation and search improvement in two phases
  • Support book reservation

  • Support more search options
  • Allow user to save the search result

  • Improve book collection
  • Buy new books
  • Add books written in other languages

  • Repair damaged books
  • Review all the books to find out those damaged

  • Referral program

  • Reserve a librarian as supervisor

My current approach is using Spacy for sentence text cleanup and vectorization, and then using sklearn.cluster.KMeans for clustering of the vectors. Here’s a summary of the code I’m using:

"""
Basic text clustering of short sentences
Inspired by https://www.kaggle.com/joehalliwell/document-clustering
"""

import numpy as np
import spacy
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

nlp: spacy = spacy.load("en_core_web_lg")

def is_pronoun(lemma: str) -> bool:
    return lemma == "-PRON-"

def process_text(text: str) -> str:
    doc = nlp(text.lower())
    result = []
    for token in doc:
        if token.text in nlp.Defaults.stop_words:
            continue
        if token.is_punct:
            continue
        if is_pronoun(token.lemma_):
            continue
        result.append(token.lemma_)
    return " ".join(result)

def vectorize(text):
    return nlp(text, disable=['parser', 'tagger', 'ner']).vector

def get_clusters(sentences, clusters=24):
    X = normalize(np.stack(vectorize(process_text(t)) for t in sentences))
    kmeans = KMeans(n_clusters=clusters, max_iter=250, init='kmeans++', random_state=1)
    kmeans.fit(X)
    # Predictions are a number, in sentence order, where each number corresponds to a cluster
    predictions = kmeans.predict(X)
    return predictions

This approach works decently, but I’d like to evaluate a deep learning model as well, based on fastai. I haven’t found anything specifically about clustering in the fastai book. I found few posts here in the forums but I’m not sure how to apply a language model to the clustering problem. What would be the outputs of such model? How would I train it? A cluster number doesn’t encode any value, it’s just the “co-occurrence” of the same cluster number for related sentences that make it a “good” value, if that makes sense.

Any suggestions on how to proceed?

Thank you!

Hey Cosimo,

Unfortunately fastai does not provide any out-of-the-box support for clustering. You could try training a language model and then using the embeddings for encoding sentences as vectors if you want to experiment with that approach.

If you want a more effective solution for clustering generic text, then in my opinion a good approach would be to encode the sentences using sentence-transformers, and then perform clustering using either hdbscan for small-scale data or FAISS for larger-scale data.

1 Like

Hi Oren,

thanks for the useful pointers.
I’ve had some time to try what you suggested and it seems to work well.
I will try to set up a comprehensive benchmark to evaluate the performance of each variant.


Cosimo

1 Like