Is there a method for calculating the best accuracy threshold for multi-label classification? (Lesson 3)

babak.jfard · July 22, 2020, 11:13pm

In the multi-label classification example - Amazon Planet - Jeremy has considered a threshold of 0.2 for the accuracy threshold. Is there a method for finding this value mathematically for different models, or is it just by trying different values and pick the best one?

butchland · July 23, 2020, 2:02pm

Hi @babak.jfard,

You can try out different values and pick the best one.

Even better, you can create a graph to visualise it (this snippet comes from the course-v4 notebook) so it’s easy to pick the threshold which gives the highest accuracy.

Just stick the following code into a new cell:

preds, targs = learn.get_preds(DatasetType.Valid)

xs = torch.linspace(0.05,0.95,29)
accs = [accuracy_thresh(preds, targs, thresh=i, sigmoid=False) for i in xs]
plt.plot(xs,accs);

Hope this helps.

Butch

olaf.goerlitz · July 31, 2020, 10:31pm

Hey @babak.jfard,

it depends on what you mean by “best threshold”. Do you want to have less false negatives or less false positives? Actually, for multi-label classification it is not so clear what accuracy, precision, recall, or false negatives/positives etc. means because you can have partially correct predictions, e.g. if only one of two labels was predicted correctly.

While I was trying to figure out myself how to calculate the accuracy correctly and which threshold to use I got the impression that this accuracy_thres implementation is not really correct for multi-label problems. In fact the documentation says it is for one-hot encoded target but multi-label is not one-hot encoded.

Basically you can use different ways to calculate the accuracy for multi-label, e.g. with the Exact Match Ratio (Subset Accuracy) or considering each predicted label individually. But none of this seems to be implemented in FastAI v1.

There was actually a suggestion on how to implement a better accuracy calculation for multi-label here A different variant of accuracy_thresh but it seems there was never a pull request made.

I also found some other threads with similar questions but none of them was satisfying for me.

Finally, it also seems that the notebook has an error. According to the post No longer able to reproduce fastprogress's accuracy_thresh in v1.0.24 accuracy_thres should be called with parameter sigmoid=False but by default it is set to True.

babak.jfard · August 27, 2020, 2:31pm

Thanks a lot @olaf.goerlitz for your detailed answer. It seems to me that a practical way would be to have each class of labels to be tested with its own metrics, so that not having just one threshold for a mixture of several label classes. I was just curious to know if there is an analytical background explaining a method for that unique threshold.

muellerzr · August 27, 2020, 2:49pm

@babak.jfard the book goes into this, check out this chapter:

github.com

fastai/fastbook/blob/master/06_multicat.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#hide\n",
    "!pip install -Uqq fastbook\n",
    "import fastbook\n",
    "fastbook.setup_book()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [

This file has been truncated. show original

babak.jfard · August 27, 2020, 3:34pm

Thanks @muellerzr. I could not find some thing more than what already available in V1. the book says:
"We can find the best threshold by trying a few levels and seeing what works best. ".

muellerzr · August 27, 2020, 3:40pm

Keep reading, it’s in there quite literally four lines of code

olaf.goerlitz · August 27, 2020, 7:31pm

Hi @babak.jfard,

I’m not sure if there is a good analytical background for choosing a good threshold. Personally, I prefer to leave the threshold at 0.5 for one simple reason: In classification you are trying to train a model which can clearly separate between the classes/labels, i.e. for binary case predict 0 and 1 as good as possible.

So if your model is equally good in predicting 0’s as it is in predicting 1’s you will have many predicted probabilities close to 0 and many close to 1 and thus a broad margin between the two classes in the middle. So a threshold of 0.5 will nicely work in this case.

However, if the model is not good in predicting one or both classes the probabilities will be spread out more between 0.0 and 1.0. Then you could try to play with the threshold or you can try to improve you model to be able to better separate between both classes. Same applies for Multi-label classification as it is also using binary cross-entropy and each label is predicted with a probability between 0 and 1.

Now, @muellerzr refers to the last block before the section about regression where a chart shows the resulting accuracy for different thresholds. But as you can see best accuracy is achieved for a threshold around 0.5.
This is what I also saw in my own experiments. So I don’t see a reason why you should choose a different threshold. I personally think that playing with the threshold is an optimization at the wrong end.

However, one last important point: the accuracy_multi in fastai is wrong. I tried to propose a correct implementation in this bug report but it was rejected by Jeremy for unclear reasons.

One more side note: If you have unbalanced data (which you typically always have in the real world) then accuracy is not a suitable metric. Instead it is better to use precision and recall (or use other techniques to balance the influence of individual classes in the calculation of accuracy and loss)

robmel · October 14, 2021, 4:43am

Just want to add on here that I found training a model prior with a certain threshold can create a certain bias towards that threshold (i.e. training with a 0.2 threshold had an optimum around 0.22). I start by training with a threshold of 0.5, and then searching for the max after testing against accuracies. I haven’t tried out much since I am still moving forward with the book but this strategy worked pretty well. Combined with a learning rate finder, discriminative learning rates and to_fp16(), my bear classifier went to 94% accuracy (using this calculation method, since fastai’s default was giving me unrealistically high accuracies initially). Anyways best of luck to ya.