Classifying large number of classes ( 5000)

I am trying to classify images that has around 5000 classes. I am using fastai v1, resnet 50.
Training errors are going down but validation error are all over the place and total error is 99-100%.

Is resnet-50, even resnet-101,resnet150 a right architecture for this kind of classification.
As i understand resnet’s are pre-trained for 1000 classes. I am going for 5000 in a last layer.
Do I need to add more FC layers or Is there a alternative network I should consider (Google search did not help much)

What’s the size of your training and validation sets? How many examples are in each class?

@bjack913 thank you for your reply,
I have divided training/validation into 80/20, I have around 40 images each class with total of around 200k images.

    tfms = get_transforms(max_rotate=20,max_zoom=1.3,max_lighting=0.4,
                          max_warp=0.4,p_affine=1.,p_lighting=1.)
    data = ImageDataBunch.from_lists(model_path, fnames=file_names,labels=labels, valid_pct=0.2,
           ds_tfms=tfms,
           size=224, 
           num_workers=8).normalize(imagenet_stats)
    learn = create_cnn(data, models.resnet50, metrics=error_rate)

Thanks for the information. How many epochs have you trained for? Keep in mind that if you randomly guessed a class out of 5000 classes, then you’d expect an accuracy of 0.02% (or an error rate of 99.98%). If your model has an error rate of <99.98% then it is in fact learning and may need to be trained for longer.

2 Likes

@bjack913 thank you for your continued support and help.
Here’s what I am doing, but as you pointed even though training error is going down but validation errors are really high and error rate is >99.98%. Training error is going down from one cycle to next.

learn.load(f'{model_path}/stage-2_unfreezed');
learn.unfreeze()
for i in range(0,10):
    learn.fit_one_cycle(20, max_lr=slice(1e-6,1e-3))
    path = f'{model_path}/stage-3-epoch_{i}_unfreezed'
    learn.save(path);
epoch train_loss valid_loss error_rate
1 2.887602 13.034577 0.999333
2 2.922368 13.079821 0.999333
3 3.048307 13.346167 0.999281
4 3.147970 13.503422 0.999256
5 3.242375 13.832283 0.999384
6 3.228010 13.872521 0.999230
7 3.215050 13.705756 0.999102
8 3.147623 13.871474 0.999307
9 3.179110 13.856788 0.999281
10 3.086861 13.711730 0.999050
11 3.123843 13.350389 0.999230
12 3.106609 13.591689 0.999204
13 2.992076 13.637679 0.999358
14 2.918751 13.508119 0.999230
15 2.928656 13.523503 0.999256
16 2.845991 13.388854 0.999102
17 2.854365 13.312586 0.999153
18 2.867545 13.408681 0.999179
19 2.828079 13.210934 0.999204
20 2.767628 13.331071 0.999230

Where are the weights you’re loading coming from?

If you try commenting out that line and training again while frozen from the pretrained imagenet weights what does the output look like?

———————————

Have you tried inspecting your batches to make sure your data is getting loaded and decoded properly?

Imagine a hypothetical scenario where you tried to train a network on random noise — there’s nothing for it to learn to classify the static because there’s no information there. The results might look a lot like what you posted above where the model is basically just randomly guessing.

What do your classes represent and where did the labels come from?

Hello @yeldarb, yes I have initially trained it with all the layers frozen except last layer, then unfreeze all the layers and tried finding optimum LR, using find_lr,
The code above is after I have already gone through above steps. Labels and classes were prepared and verified even before I started training process.

Yes and what were the results?

5000 classes… seems like a facial recognition problem. Previously, I saw someone using Siamese networks to solve this. Not sure, whether it is implemented in fastai.
https://www.quora.com/What-are-Siamese-neural-networks-what-applications-are-they-good-for-and-why

In my own experience, Siamese Networks may offer 3 distinct advantages over Traditional CLASSIFICATION!

These advantages are somewhat true for any kind of data, and not just for Images (where these are currently most popularly used).

  1. CAN BE MORE ROBUST TO EXTREME CLASS IMBALANCE.
  2. CAN BE GOOD TO ENSEMBLE WITH A CLASSIFIER.
  3. CAN YIELD BETTER EMBEDDINGS.

Let’s say we want to learn to predict what animal is in a given image.

  • Case 1 : if it is just 2 animal classes to predict from (Cat vs Dogs) and given millions of images of each class, one could train a deep CNN Classifier. Easy!
  • Case 2 : but what if we have tens of thousands of animal classes and for most of these, we only have a few dozens of image examples? Trying to learn each animal as a Class using deep CNN seems less feasible now. Such a classifier can perform poorly for rarely seen training class e.g. let’s say there were only 4 training images of ‘eels’

I found a large FC layer helped on a 1500 class unbalanced task. I would try experiments running on fewer (500,1000,2000,etc) classes and see if performance falls off a cliff somewhere. Another strategy is to run a coarse then fine grained classifier if your class distinction is broad enough. Eg, imagine separate ‘car’, ‘van’, ‘motorcycle’ classifiers not just ‘vehicle models’.

@yeldarb the results before unfreeze were similar to the one I posted above.
2.767628 13.331071 0.999230

@hanman, thank you for your inputs, let me see if siamese n/w can help.
But yes this problem does fall into case2 you mentioned.

So the thing I was trying to get at was that if you accidentally trained on too high of a learning rate and had your model diverge it could be in an unrecoverable state and no matter how much training you do it will never come back to reasonable weights.

That’s why I suggested trying to start again from the beginning.

the results before unfreeze were similar to the one I posted above.
2.767628 13.331071 0.999230

If these numbers are from the end of your training run they’re not helpful in diagnosing the issue. You’d want to see if your error_rate and loss were getting worse and worse from the beginning.

For instance, this is what happens if you initially train lesson1-pets with a learning rate of 1 and then unfreeze, do the learning rate finder, and try to bring things back to sanity:

With 5000 classes I imagine it would be even harder for it to start finding its way back than with only 37.

1 Like

@yeldarb, thank you for your continuous inputs.
Here’s what I did

learn.fit_one_cycle(10)
## training error went down on each epoch as expected, but validation and error remained at 12 and 99.99 respectively
learn.recorder.plot_losses()
learn.lr_find()
learn.recorder.plot()
learn.fit_one_cycle(10,max_lr=slice(1e-6,1e-4)

I am running the last line in a loop , As I pointed before after each call to fit_one_cycle, training is consistently going down but valid_loss and error_rate remains high.

epoch train_loss valid_loss error_rate
20 2.767628 13.331071 0.999230

after few cycles, training error is at 2.1 but valid_loss/error loss remains unchanged.

1 Like

@iyersathya Were u able to solve this? Can you share your learnings here?

2 Likes