Identifying which class/category labels are most similar and merge them?

Hi everyone! This class and forum has been great.

I have a specific ML algorithm question. I have a dataset with hundreds of classes/categories and a test set where those categories are unknown (or are new categories not in the training set).

I am trying to find a way to encode the categories so that I can then come up with “predicted” categories for each entry in the test set and use that prediction to help increase accuracy.

Unfortunately, there are too many classes/categories in my training set so I overfit. Is there a good way to identify which classes are most similar so that I can merge similar ones together? Or is there another good way to better consolidate so many classes/categories?

Please let me know and thanks!

Hey Charlie,

I’m not sure what your categories represent but do take a look at lessons 9&10 from part 2, more specifically the approach based on the Devise paper(http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf).
The method should allow you to do a lot of cool stuff including merging similar categories (although (spoiler alert! :slight_smile: ) you might not want to do that once you switch from the discrete classes/categories space to the continuous embedding space representations).

It would actually be quite interesting to see the results of pseudo-labeling in this situation. Also a comparison of results between using the actual predicted embedding values as pseudo-labels vs selecting the nearest neighbor of the prediction from your pre-existing training embeddings and using that as pseudo-label.

Did anyone try something like this already ?

I’d also suggest the approach from lesson 14 - creating vector representations of the categories, instead of treating them as one-hot encoded levels.