Understanding Help: Layers and classifications

I’m trying to wrap my head around how I would scale-out classifications into a much larger number of classifications. Typically for transfer, learning we drop the last softmax layer and add the number of classifications. Usually, there are fewer classifications (Dogs vs. Cats is 37) but could take imagenet up to100,000 classes (“Dogs vs. Cats” for all the animal breeds!) it would look like this.

However, that doesn’t seem right. I can’t shake the feeling that there need to be more layers in this case. Just as imagenet has Animal -> Chordate -> Vertebrate -> Bird -> Parrot we could have a CNN to detect Animals. Make another CNN to detect Chordate, etc.

Or we could just make sure it is part of our model. Now I would think that we would add in additional layers. Like below, so that we could ramp up to the target state.

Is this a reasonable assumption, or am I just being a little crazy here? Are there any good papers on this type of problem?


In terms of the model structure think that the first approach you suggest is fine, remember that if your final layer of the resnet has N activations you’ll be adding N x 100,000 new weights. You can think of it as fitting a separate logistic regression with N inputs for each of the 100,000 categories.

The sticking point is actually the computational cost of doing that enormous matrix multiplication and efficiently exponentiating 100,000 logits for each training example.

Because the output of language models comes down to predicting the next word from a large vocabulary, this problem is faced in this context. The solution often used in the language context uses an idea along lines of your Animal -> Chordate -> Vertebrate -> Bird -> Parrot heirarchy. Essentially you partition your vocabulary up by the frequency of the words. This approach is called “Adaptive Softmax” - https://arxiv.org/abs/1609.04309.

There’s a nice explanation of this approach here https://towardsdatascience.com/speed-up-your-deep-learning-language-model-up-to-1000-with-the-adaptive-softmax-part-1-e7cc1f89fcc9

You’ll have to think about how applicable this is to your domain, for example if your classes are equally distributed this approach may not help.


If the last layer of the ResNet outputs 2048 features, then the classifier assigns each class to a possible combination of these 2048 features. There are many more than 100,000 possible combinations, so in theory each of the 100,000 classes can be distinguished from the others.

However, this only works if (some of) the 2048 patterns that are detected by this last ResNet layer are present in all of the 100,000 classes. If there is a class X that matches with none of these patterns, it will never be detected.

…That hierarchical relationship is what Hinton’s Capsule Network is going to address:


Yep, but unfortunately capsnet are hard to train over large datasets