I would like to have a loss function that reduces false positives. (I have a very imbalanced data set.)
(The business case is that I’d rather something misclassified as urgent than urgent being mis-classified as not urgent.)
Ideally this loss function would work in keras for neural networks and also sklearn.
Ideally we can adjust the sensitivity to false positives (i.e. false positive errors are 2x as worse as false negatives etc…)
Any have any experience / ideas on how to do this?
(Or what hyperparams I can tweak on classical ML models to achieve the same effect?)
Thanks in advance!
I’d like to understand this question better too. This discussion about the planet competition might be interesting to read; they talk about directly optimizing the F2 score rather than the cross entropy.
That said, I’m confused about why the approach in the link is necessary. I would have thought that, assuming you’ve trained with regular cross entropy, you then just need to separately pick a threshold for pulling the trigger and marking something as positive/negative. So, conceptually, during training you’re teaching the network to model the conditional distribution p(c|x)
, for class c
given input x
, and then the threshold is a separate decision criterion; the conditional distribution tells you everything you know about what class the input could be, but actually deciding to pull the trigger and mark it as positive or negative depends on how you feel about false positives, etc. I would think that given a good conditional distribution p(c|x)
, the optimal threshold would be completely determined by your aversion to false positives.
Seems like this paper derives the optimal threshold for the F1 score? I haven’t read it very carefully yet, but thought I’d go ahead and post my response since I’m curious too
I think the class weights in Keras training would do this ?
Basically you give it a weight for each class, and the loss for that class is multiplied by the weight, penalizing errors in that class more.
I used it to handle class-imbalance to avoid the model fixating on predicting the majority class, but I assume it would reduce false positives the higher it is for the class (Though it will also cause much fewer predictions for that class, especially if the model isn’t very confident).
From my experience it works with soft-max+categorical_cross_entropy output, not sure about binary/sigmoid output.
Good points. I actually tried to implement threshold and tune it as a hyper parameter. but it appears since I don’t have a lot of data and it’s a binary classifier. the the pred_proba (aka the softmax activation is the same for most / all. i.e. if I set the threshold to 0.7502 I get 20% in class 1 (majority class) if I set it to 0.75021 I get 95% class 1).
so I guess I need to use a weighted cost function such that it penalized class 2 (I only have 2 classes, either 2 or 1) if it’s classified as class 1 by a factor/weight of 1.2 etc…
(the weight can be tweaked I think and might give me more fine tuning abilities vs. the all or none effect when tweaking the threshold.
let me know if you have additional thoughts/comments as to what to try that might lead to better results.
do you happen to know how the class_weight calculation is actually used when calculating the loss? I could not manage to find it. thx!
I’m not certain but my assumption is it just multiplies the loss for that class by the weight. So for categorical cross entropy the loss would be:
sum(w_itrue_ilog(pred_i))
For each class i where w_i is the class weight for that class. So if all weights are 1 it’s the regular cross entropy loss.
This is a little hacky, but it’s so simple it was kind of a face-palm moment when I realized it. So I thought I’d share.
I tried various types of models (SVM, LR, etc), tweaked class weightings, without any noticeable progess. Modifying loss functions became tricky as we have to worry about differentiability and stuff.
Finally I realized if I make duplicate entries for negative elements it forces the model to prefer false negatives over false positives (since FPs count twice) It’s kludgy as hell, but it has the added benefit that it works for absolutely any kind of model at all, and for NLP it weights the vocabulary for the critical class more heavily.
You can add n duplicates to tune how paranoid your model is about FPs.