I recently came to understand what the Kappa scoring metric actually does. Now I would like to know if there are metrics that can help me in this scenario: I have 2 models, model A and model B, and an image to classify. The Kappa Score gives me the agreement of the two models A and B over a whole collection of images, but is there a metric that can tell me which model is more trustworthy for a particular prediction?
I’ll use the example of the teddy bear classifier. Suppose model A is best at recognizing teddy bears and B is best at black bears (but I did not know this), is there a metric that lets me choose which prediction to trust? I can understand using a validation set and manually looking at class-wise accuracy, but is that the only way?