Hi!
I am working on a proof-of-concept application to cluster short sentences, in English for now, but the idea is to be able to expand the model to be language-independent. As a fallback, I’m thinking of doing a specialized per-language solution vs a distinct solution for the other languages, perhaps using subword tokenization.
Here’s an example of what I’m trying to achieve. Given the following “feature requests” from hypotetical users:
- Support book reservation online
- Support more search options
- Support book reservation and search improvement in two phases
- Allow user to save the search result
- Support book reservation
- Reserve a librarian as supervisor
- Improve book collection
- Buy new books
- Add books written in other languages
- Repair damaged books
- Review all the books to find out those damaged
- …
I would like to extract clusters of sentences, something along the lines of:
- Support book reservation online
- Support book reservation and search improvement in two phases
- Support book reservation
- Support more search options
- Allow user to save the search result
- Improve book collection
- Buy new books
- Add books written in other languages
- Repair damaged books
- Review all the books to find out those damaged
- Referral program
- Reserve a librarian as supervisor
My current approach is using Spacy for sentence text cleanup and vectorization, and then using sklearn.cluster.KMeans for clustering of the vectors. Here’s a summary of the code I’m using:
"""
Basic text clustering of short sentences
Inspired by https://www.kaggle.com/joehalliwell/document-clustering
"""
import numpy as np
import spacy
from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans
nlp: spacy = spacy.load("en_core_web_lg")
def is_pronoun(lemma: str) -> bool:
return lemma == "-PRON-"
def process_text(text: str) -> str:
doc = nlp(text.lower())
result = []
for token in doc:
if token.text in nlp.Defaults.stop_words:
continue
if token.is_punct:
continue
if is_pronoun(token.lemma_):
continue
result.append(token.lemma_)
return " ".join(result)
def vectorize(text):
return nlp(text, disable=['parser', 'tagger', 'ner']).vector
def get_clusters(sentences, clusters=24):
X = normalize(np.stack(vectorize(process_text(t)) for t in sentences))
kmeans = KMeans(n_clusters=clusters, max_iter=250, init='kmeans++', random_state=1)
kmeans.fit(X)
# Predictions are a number, in sentence order, where each number corresponds to a cluster
predictions = kmeans.predict(X)
return predictions
This approach works decently, but I’d like to evaluate a deep learning model as well, based on fastai. I haven’t found anything specifically about clustering in the fastai book. I found few posts here in the forums but I’m not sure how to apply a language model to the clustering problem. What would be the outputs of such model? How would I train it? A cluster number doesn’t encode any value, it’s just the “co-occurrence” of the same cluster number for related sentences that make it a “good” value, if that makes sense.
Any suggestions on how to proceed?
Thank you!