Without googling first (uh oh), my gut reaction would be to make a model per class. In particular, I’d take VGG, reset the dense layers, and train with binary labels. All those forward passes might be too expensive in production, though. With some more thought or some googling I think we could figure out a single-model solution.
Edit:
More thoughts:
Hmm. The targets could be…sparse like a one-hot encoding but while allowing more than one class turned on in the vector. I wonder how this would interact with the softmax layer. Maybe we wouldn’t want the outputs to sum to 1 anymore, or maybe that wouldn’t matter. Would we pick a threshold? How would we find it? Hmm.
Edit:
One idea would be to start by training a separate model for each class as a benchmark. Then, you could combine those models into a larger model (say, by using Keras’ merge function), putting some dense layers on it (or maybe even more convolutional layers) and seeing how it does, with respect to the benchmark.
Maybe the larger would perform better because it has more information and capacity for learning, but maybe it would perform worse in a red herring way. And this might depend on the class. Maybe combining two specialist networks A and B would perform better than A on A’s class, but perform worse than B on B’s class. In this case you could use the larger network for A’s class and use B for B’s class.
Edit:
Here’s a paper (2014): CNN: Single-label to Multi-label
I couldn’t figure out what they’re doing from the abstract, but mentioning semantic segmentation gave me an expensive idea for multi-label classification. You could segment the images into regions and label those regions (e.g. people are here, crop X is there, crop Z is there) for your data. Then, you could train a semantic segmentation network on this data. Then, after doing a forward pass, you could consider an entity to be present if the number (or percentage) of pixels corresponding to it is above a threshold. I think the threshold would have to be different for each class. And spatial resolutions would affect it and so would how much the object fills the image…this idea is dying fast.
Okay. I hope these thoughts helped. To go further it would help to know more about your problem; for example, knowing the kinds of things you want to detect and the nature of the images (e.g. will they be taken from a satellite, a drone, or a person?).