1000s of categories

Hello everyone,
I’m trying to resolve a classification problem but includes more than 1000 different classification options.

I’m trying to resolve this like a cats/dog problem or sentiment analysis problem but instead of two classes I have 1000.
data.class = “Class 1” … “Class 1000”

Is this the correct way to do it?
It is taking a lot of time to go above 50% accuracy.

Is there any other way to do it?

1 Like

Basically, yes it’s the same. Here are some pointers.

Re 50%. Often we use a top-3 or top-5 metric rather than top-1. It depends on the real life use case. You often find another 10% in the next few results.

What network architecture are you using? You probably need to try something big like dn201, res 152 or wrn. Try adding an xtra_fc=[x] where x is higher than your numcats, or as large as the last model layer. And if your use case allows, ensemble multiple models and do cross validations.

Be very careful with your train/val/test stratification. With many cats you can find poor representation after splitting. You need to address the weighting in some way, eg under/over-sampling, data augmentation, weighting the loss function, or post-training probability adjustment.

If it is taking a long time, start with a much smaller sample. This is a quick way to equalise the sampling, too.

Also take a close look at your confusion matrix, you’ll see where the model is struggling. Sometimes it is worth running a separate model to preclassify troublesome cats. Some people go so far as to classify into coarse grained cats (eg water animals, land animals, air animals) before running more fine grained models, but it’s never worked for me. Use cases matter a lot with fine grained problems.

Taking these approaches can together often reduce your error by half or more.

1 Like

I’m trying this with text NLP.
I’m use the Wikipedia and sentiment analysis example and I’m expanding it to include multiple categories in there.

I think this is what I’m going to try.
two step process.

  1. Model 1 for tier-1 categories
  2. Smaller (subsets) models that will point to the secondary categories.

Any update on how this went? I am also interested in performing a classification with ~1200 labels…ideally as a multi label classification.

I’m using the ULMFit model to do this.

The model with 1000s was unable to pull more than 43% accuracy and a great loss on validation and training
I decided to move to less categories 50 and I went to 75% accuracy but the still erratic behavior with validation and training.
Not a single prediction on the 50 categories was able to pull more than 50%.
When I was trying to predict I was getting the same results in the same order all the time.

I think the reason for that is that the model that we have is for sentiment analysis. (0,1).

c=55 # classes

m = get_rnn_classifier(bptt, 20 * 70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

I think the 50 and c will need to be modified to converge correctly

Any suggestions???

Any updates on your work? I am trying to use ULMFit on a data set with around 400 classes.

I abandoned the quest.
Did you figure this out?

No, not yet. I am fairly new to this and I jumped into ULMFit straight away. Could you pleae take a look at this thread and give me any advice?

“Try adding an xtra_fc=[x] where x is higher than your numcats, or as large as the last model layer.”

Could you talk a little bit more about that line?