Multi-class classification by combining multiple image segmentations

I am trying to classify an image set where each image contains n instances of objects with the same class, where n can vary widely between different images. (I.e. if I had a bunch of trays with an undetermined number of coins on them, each tray is guaranteed to contain exactly one type of coin. The coins may be face up or face down. I take a bunch of images of each tray with different orientations/ lighting parameters, and now I want to classify the coins by tray.) See Multiple Instance Learning wiki page.
My approach is to take each photo, draw bounding boxes around individual coins, and isolate each bounding box into its own segmentation image. Then I trained Resnet34 to classify the single coin in each cropped image. I have a few questions about next steps:

  • What are some ways I can optimize the precision for the cropped images?
  • What are some ways I can aggregate predictions for individual coin segmentations into one prediction for the entire tray?
  • How can I measure confidence in a prediction?

I am considering:

  • simple polling (class with the most number of individual coin predictions per tray)
  • weighted polling (sum up softmax layer tensors and use the class corresponding to max activation)
  • Monte Carlo Dropout for inference stage for individual coin segmentations, then calculating entropy as a measure of confidence and using it as a threshold before aggregating segmentation predictions.
  • Plotting multi-class precision-recall curve