I’ve been trying to figure this out for a while but can’t completely understand how Google can predict millions of labels for images. I can’t imagine a CNN with 1 million output labels.
One way would actually be to use a hierarchy of classifiers but that would also cascade the errors. Moreover, the granularity of labels in Google vision api would require hundreds or thousands of classification models. Can anyone who has worked long enough in industry answer this ?