I’ve written a notebook that shows a way to train an autoencoder on categorical variables, and use the features of the encoder as the basis for a clustering algorithm.

Comments appreciated.

Thank you.

Cool notebook! Just curious (and maybe there’s an obvious answer I’m missing here), why do you train an auto encoder then do further dimentionality reduction with TSNE? Why not just go straight to TSNE or have our autoencoder output encoded predictions with 2 dimensions and *then* do clustering? Pros/Cons?

Hello,

Thank you for your question.

T-SNE is only used for visualization in the notebook.

UMAP is used for dimensionality reduction, and HDBSCAN is used for clustering.

I found that UMAP is doing an incredible job in preparing the dataset for density based clustering, much better than the auto-encoder can do. Or T-SNE for that matter.

I’m using an auto-encoder because with fastai I can build one for categorical variables, which is the main point of this notebook: clustering of categorical variables.

This is just a proof-of-concept. For work I need a similar approach with many categorical variables, and I thought let’s try this first on MNIST, because it provides me with labels, so I can verify the correctness of the clustering. I was actually very surprised that this works (97% accuracy with minimal training)

I see, thanks for the answer. I’m planning on doing a similar approach but with tabular data, and then for each category (like your example, 1 category per cluster) find the ‘key features’ that best told the autoencoder to classify each item into that category (basically feature importance). Definitely using your notebook as a resource

Nice to hear.