Accuracy metric for multi-label classification

Hi,

Chapter 6 of fastbook (fastbook/06_multicat.ipynb at master · fastai/fastbook · GitHub) is suggesting accuracy_multi as a metric for a multi-label classification model, which is capable of assigning 0 or multiple labels to the same input image.

You can see the accuracy reaching 0.950877 at the end of fine_tune() for thresh = 0.2. And it can go higher into 0.96+ range if thresh = 0.5 is selected.

I am struggling to interpret the significance of these relatively high accuracy values. Does 0.96 supposed to mean “96% accurate”?

Looking at the implementation of accuracy_multi:

def accuracy_multi(inp, targ, thresh=0.5, sigmoid=True):
    "Compute accuracy when `inp` and `targ` are the same size."
    inp,targ = flatten_check(inp,targ)
    if sigmoid: inp = inp.sigmoid()
    return ((inp>thresh)==targ.bool()).float().mean()

… the result of the function will be going up in two cases:

  1. if targ[i].bool() == True, indicating that a corresponding label is assigned in the input dataset AND inp>thresh, indicating that the model prediction also assigned a corresponding label

  2. if targ[i].bool() == False, indicating that a corresponding label is NOT assigned in the input dataset AND inp<=thresh, indicating that the model prediction also did NOT assign a corresponding label

There are 20 labels in this dataset:

len(dls.vocab)
--
20

Most images in PASCAL_2007 the dataset have just a single label assigned. Some have a few. None have anywhere near 20. This means that the accuracy will be quite high numerically (close to 1) as far as the model doesn’t assign too many labels to each image.

Here is, for example, the impact on accuracy of setting all predictions to 0, which would be equivalent to not assigning any labels to any images:

# First, calculate accuracy of the model predictions after fine_tune
preds, targs, decoded = learn.get_preds(with_decoded=True)
accuracy_multi(preds, targs, thresh=0.5, sigmoid=False)
---
TensorBase(0.9643)
zero_preds = torch.zeros(preds.shape)
accuracy_multi(zero_preds, targs, thresh=0.5, sigmoid=False)
---
TensorBase(0.9224)

The accuracy did go down from 0.9643 to 0.9224. But intuitively 0.92 is still a relatively high, indicating a decent prediction, which it is clearly not!

Am I misinterpreting the accuracy_multi approaching 1 as an indication of the model converging?

And how do I deal with the multi-label classification model not assigning any labels to my images, while the accuracy is close to 1 as I illustrated above?

PS: I am contemplating this because my own multi-label classification model (for a different dataset) is producing seemingly great results by quickly converging on accuracy of 0.97-0.98. But the classification results shown by learn.show_results() are awful, with most images not getting any labels assigned.

So, turns out that accuracy_multi is in fact not the bestest metric to evaluate a multi-label classifier.

Here is a blog post on accuracy vs. precision vs. recall and various forms of F1-score:

Apparently, sampled F1 score is the most appropriate for a multi-label classifier. But here are all of them:

th = 0.5

f1_macro = F1ScoreMulti(thresh=th, average='macro')
f1_macro.name = 'F1(macro)'
f1_samples = F1ScoreMulti(thresh=th, average='samples')
f1_samples.name = 'F1(samples)'
f1_micro = F1ScoreMulti(thresh=th, average='micro')
f1_micro.name = 'F1(micro)'
f1_weighted = F1ScoreMulti(thresh=th, average='weighted')
f1_weighted.name = 'F1(weighted)'
precision_samples = PrecisionMulti(thresh=th, average='samples')
precision_samples.name = 'Precision(samples)'
recall_samples = RecallMulti(thresh=th, average='samples')
recall_samples.name = 'Recall(samples)'

metrics=[partial(accuracy_multi, thresh=th), f1_macro, f1_micro, f1_samples, f1_weighted, precision_samples, recall_samples]

learn = vision_learner(dls, resnet50, metrics=metrics)

---

epoch	train_loss	valid_loss	accuracy_multi	F1(macro)	F1(micro)	F1(samples)	F1(weighted)	Precision(samples)	Recall(samples)	time
0	0.889532	0.625631	0.687012	0.273983	0.294667	0.300496	0.422663	0.190353	0.866886	00:05
epoch	train_loss	valid_loss	accuracy_multi	F1(macro)	F1(micro)	F1(samples)	F1(weighted)	Precision(samples)	Recall(samples)	time
0	0.691380	0.534147	0.766175	0.334117	0.361371	0.379858	0.479771	0.259652	0.875013	00:06
1	0.616614	0.422290	0.866952	0.444276	0.487374	0.516674	0.564330	0.412549	0.842112	00:06
2	0.488003	0.237596	0.952590	0.667484	0.704274	0.715103	0.710030	0.728679	0.768831	00:06
3	0.340728	0.151208	0.960896	0.674105	0.723793	0.695055	0.717896	0.745299	0.697371	00:06
4	0.236615	0.119779	0.963107	0.674595	0.726924	0.694284	0.717772	0.762616	0.676826	00:06
5	0.172831	0.110419	0.964143	0.674246	0.733412	0.699008	0.723016	0.770086	0.679137	00:06
6	0.133399	0.105740	0.964741	0.705270	0.743180	0.713393	0.737091	0.775930	0.699037	00:06
7	0.107876	0.103306	0.965319	0.712304	0.750823	0.725271	0.745438	0.784097	0.715405	00:06
8	0.094106	0.102901	0.965418	0.717272	0.752988	0.731740	0.747411	0.789509	0.723141	00:06
9	0.087290	0.102577	0.965259	0.717973	0.751567	0.728454	0.745915	0.785989	0.720684	00:06