Clustering categorical variables

botkop · April 7, 2020, 3:06pm

I’ve written a notebook that shows a way to train an autoencoder on categorical variables, and use the features of the encoder as the basis for a clustering algorithm.
Comments appreciated.
Thank you.

github.com

botkop/mnist-embedding/blob/master/notebooks/mnist-embedding-autoencoder.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering of categorical variables with an autoencoder, UMAP and HDBSCAN\n",
    "\n",
    "This notebook shows how to train an autoencoder on categorical variables, then use its inference for clustering.\n",
    "\n",
    "We use MNIST as our training dataset, and treat every pixel as a categorical variable. \n",
    "The labels of MNIST are only used to verify the correctness of the clustering.\n",
    "\n",
    "Steps taken:\n",
    "  * transform training data to integers\n",
    "  * create a TabularDataBunch (fastai) of the training data\n",
    "  * create an autoencoder with embeddings for all categorical variables, a few fully connected layers, and a bottleneck.\n",
    "  * train the autoencoder\n",
    "  * remove the decoder from the autoencoder, and keep the encoder\n",
    "  * push the training set through the encoder, and collect the encoded features\n",

This file has been truncated. show original

waydegg · April 11, 2020, 4:24am

Cool notebook! Just curious (and maybe there’s an obvious answer I’m missing here), why do you train an auto encoder then do further dimentionality reduction with TSNE? Why not just go straight to TSNE or have our autoencoder output encoded predictions with 2 dimensions and then do clustering? Pros/Cons?

botkop · April 11, 2020, 5:15am

Hello,
Thank you for your question.

T-SNE is only used for visualization in the notebook.
UMAP is used for dimensionality reduction, and HDBSCAN is used for clustering.
I found that UMAP is doing an incredible job in preparing the dataset for density based clustering, much better than the auto-encoder can do. Or T-SNE for that matter.

I’m using an auto-encoder because with fastai I can build one for categorical variables, which is the main point of this notebook: clustering of categorical variables.

This is just a proof-of-concept. For work I need a similar approach with many categorical variables, and I thought let’s try this first on MNIST, because it provides me with labels, so I can verify the correctness of the clustering. I was actually very surprised that this works (97% accuracy with minimal training)

waydegg · April 11, 2020, 5:46am

I see, thanks for the answer. I’m planning on doing a similar approach but with tabular data, and then for each category (like your example, 1 category per cluster) find the ‘key features’ that best told the autoencoder to classify each item into that category (basically feature importance). Definitely using your notebook as a resource

botkop · April 11, 2020, 5:56am

Nice to hear.