How can a Google Vision API predict millions of labels when state-of-the-art image recognition struggles with 1,000 categories?

I’ve been trying to figure this out for a while but can’t completely understand how Google can predict millions of labels for images. I can’t imagine a CNN with 1 million output labels.

One way would actually be to use a hierarchy of classifiers but that would also cascade the errors. Moreover, the granularity of labels in Google vision api would require hundreds or thousands of classification models. Can anyone who has worked long enough in industry answer this ?


We can imagine a CNN to output an embedding, and then you can query the closest points in that embedding spaces in order to predict your classes.

By doing that, you can have a product similar to Google Vision Api and your accuracy depends on the quality of your embedding space and the number of images you have in this space.