Multi-Label Predictions - Threshold Guidance - Planet Notebook Example

chrisoos · October 16, 2018, 2:57pm

This question references the planet_cv.ipynb notebook. Specifically:

labels = test > 0.2

Ignoring the fact that test is only defined later on in the notebook, the way I read this, is that all labels with predictions greater than 0.2 should be marked as True and eventually included in the list of predicted labels.

My question is: How is the 0.2 threshold determined?

To me this seems lower than what I would have intuitively guessed (i.e. 0.5). Also, when changing this value it appears to have significant impact on labels predicted and f2 scores, which can undo a lot of the work put in to fine tune the model. So to me it seems important to know how to set this value.

While we are talking about the f2 function in planet.py also seems to default around 0.2 starting at 0.17 and ending 0.24. I believe it must be related…

So, in writing up this question I think, I might have found the answer but, decided to post anyway. The very next function in the planet.py file is: opt_th

This is never run in the notebook so I missed it. I believe running opt_th returns the optimum threshold to return highest f2 score which, should be used when shortlisting the predicted labels.

I suspect this was run early on outside of the notebook to land on the threshold of 0.2 but should not be ignored on your own datasets and even after retraining as the impact is significant!

archer · December 8, 2018, 7:44am

i have same question

chrisoos · December 10, 2018, 6:40pm

In reference to other posts in the forum I discovered that it turns out this number is specific to the dataset and was proposed by other kaggle entrants to be 0.2.
To find the optimum threshold it is possible to iterate through 0 - 100 (or some more sensible window) to find the value that provides the highest f-score for the validation set.
I’ve also found that this value shifts as the model is trained so ended up refining before preparing an test entry to kaggle. Also just out of interest, the value of it has no impact on training as it is only used to calculate the metrics.