Seeking input on a multi-view classifier architecture for a fine-grained task

I’ve got an application where I need to accurately classify a set of multiple items in a container, and I’m looking for some advice on model architecture for how to solve this problem. Since the application has some confidentiality, I’ll propose an analogous problem that shares notable features with my application: Sorting collector coins by which design they’re minted with.

Let’s imagine I have a collector’s coins business and I need to fulfill orders of specific state/year US quarters to the collectors. When I fill the orders, I want to run the quarters through an automated check that I’ve got the right coin.

Features of the problem:
⁃ I’ve already got the quarters in cups, and they should all be identical quarters in the cup
⁃ I’ve got a fixed webcam that I can put the cups under that’s taking the photos
⁃ Each cup will have multiple quarters, not necessarily facing the same direction (face up or down)
⁃ The check will output either a confirmation that it’s the right kind of coins, or it will prompt for a human to check.
⁃ I want to automate checks for as many of the quarters as possible, but since some of these coins are quite valuable, I’d rather optimize for precision at the cost of recall.
⁃ There are hundreds of types of quarter, so only being able to precisely classify a subset of them is okay
⁃ If the model lends itself to active learning as I check the quarters or capture more photos of rarer coins that would be big plus

The previous way I’ve approached this problem is as follows:
⁃ Train an object detection model that’s capable of drawing a good bounding box around each quarter
⁃ Train a resnet classifier to classify the individual coins into classes
⁃ Run inference:
⁃ Use the object detector to draw bounding boxes on the image with multiple quarters
⁃ Crop out each quarter
⁃ Run that cropped image through the classifier
⁃ Combine together the results of each resnet classification

Full image -> segmentation -> crops -> classifier -> vote aggregation

I’ve tried aggregating the resnet results both by using each result’s single label as a vote, and also by summing the output of the final softmax and picking the highest value that way. Right now this technique performs pretty well on classes that are not as similar to one another (i.e. different sizes or colors of coin) however, on coins that are very similar save for some fine detail, the model is sometimes a bit of a coin toss between them. It also seems to suffer a bit as the number of classes increases, although the added classes have fewer training images as well which is a confounding factor.

One weakness with this approach is that this approach forces each neural-net classification to be done in a vacuum, whereas the comparable human task is more likely to involve looking around at a couple of coins and making an evaluation based on the aggregate. To improve the performance, I’ve been thinking that it would help to use a model architecture that allows for the model to learn about the whole ensemble of coins in the image. Since they’re all the same in a coin cup, it seems like using the group should provide more information to the model than just each tiny classification task. Some model architecture ideas that I’ve thought of are below:
⁃ Super basic version: make a resnet-based classifier that runs on the full images, not the crops
⁃ Use a deep metric learning approach by training a backbone with contrastive, triplet, or other multi-pair wise loss function and use Euclidean distance or cosine similarity to compare the mean of a group of cropped coin embeddings against known embeddings. I know embedding similarity approaches like this can perform well on fine-grained classification tasks, such as similar-looking coins.

The model idea I’m most interested in fleshing out, however, is as follows:
⁃ Make a model that takes in multiple crops from the full-cup image, runs each of them through the backbone layers of a resnet and obtains an activation for each crop, then combines them into a single activation of fixed length (using max pooling?) which would then pass through two fully-connected layers at the end and be classified.



This model idea based on the default cnn_learner in fastai, with a resnet backbone feeding into two fully connected layers that perform the final classification. The goal is to allow the FC layers at the end to look at some representation of “all of the coins in an image” instead of only looking at individual images. Hopefully this will allow the training process to internalize some idea of which kinds of cropped images are more informative than others, so that the pooled activations are more representative of each class.

I have a few questions about this model:
⁃ Is it possible to have a module that will run sets of resnet backbones on the variable number of cropped images that this model will produce between images?
⁃ Does this plan pass the smell test? Will it likely achieve what I’m describing?
⁃ In making the classifications, since the fronts of coins will look relatively similar to one another and the backs much more identifying, is it likely better to separate fronts and backs of coins into separate classes?

In addition, I would appreciate any advice on other architectures/pipelines that might be better suited to this task. In keeping with the coin identification question, the ability to perform with pretty high precision on fine-grained classification is key, especially across large numbers of classes (relatively low recall is okay here, it can be kicked back to a human for verification).

Thank you!

Try including a triplet loss function along with the classification step. Triplet loss will help your classifier to identify two nearly identical coins with different classes. Refer this paper https://arxiv.org/abs/1903.10663 This can give an additional boost to the model.

I don’t understand object detection step. If you know that all quarters in the image are the same, you could crop the image into different regions.

About the classifier, you have different options:

  1. train a regular classifier including triplet loss like said by SamJoel . Also, for the regular classifier loss (I suppose you are. using CrossEntropy), I would try to focus in this confused example by using focal loss, hard examples mining, loss weighting and/or upsampling techniques.
  2. If it doesn’t work, train a regular classifier. For those classes that the model confuse most, train another model to deal with this cases:
    • Regular classifier with only those classes and triplet loss.
    • Use a siamese network trained with triplet loss.

Finally, my five cents of the classifier head. I would use the first approach were each classifier makes a predictions and then vote with to modifications:

  1. Train classifiers as a multilabel problem. So, you can now if they are uncertain about some predictions (it will help for the voting). If you approach the problem as single classification, you are forcing the model to only choose one label although it may be unsure between two labels.
  2. Use a model (like LightGBM) to make the vote instead of directly averaging the results. Why? You can see each coin prediction as an independent inference as you would have with 1 image and an ensemble of models. So, you can exploit that to increase the global model accuracy.

I hope that you will find it useful

Thanks, this paper is a great resource for creating generalized descriptors from a CNN backbone. I also think that the ensemble effect might boost performance a fair bit on the hard classification steps.

For combining loss functions, is it sufficient to sum the two losses together and let the optimizer descend the combined curve?

Yeah mostly combining losses means just adding up those losses. But there’s also another method where they weight losses…like total_loss = a1loss1 + a2loss2 ,where a1 and a2 are hardcoded at the beginning of training. Those weighted numbers a1 and a2 may influence the model on how much importance should be given to each loss…intuitively…