I have a specific ML algorithm question. I have a dataset with hundreds of classes/categories and a test set where those categories are unknown (or are new categories not in the training set).
I am trying to find a way to encode the categories so that I can then come up with “predicted” categories for each entry in the test set and use that prediction to help increase accuracy.
Unfortunately, there are too many classes/categories in my training set so I overfit. Is there a good way to identify which classes are most similar so that I can merge similar ones together? Or is there another good way to better consolidate so many classes/categories?
I’m not sure what your categories represent but do take a look at lessons 9&10 from part 2, more specifically the approach based on the Devise paper(http://papers.nips.cc/paper/5204-devise-a-deep-visual-semantic-embedding-model.pdf).
The method should allow you to do a lot of cool stuff including merging similar categories (although (spoiler alert! ) you might not want to do that once you switch from the discrete classes/categories space to the continuous embedding space representations).
It would actually be quite interesting to see the results of pseudo-labeling in this situation. Also a comparison of results between using the actual predicted embedding values as pseudo-labels vs selecting the nearest neighbor of the prediction from your pre-existing training embeddings and using that as pseudo-label.