Metrics on onehot values

EDIT: I can’t delete the post but while I was editing it I was actually writing the solution to my own problem. There is actually nothing wrong with everything below (I think).

Hello everyone,
So I need your help with something that most of you will probably find really simple.
Basically, I’m working on a project with multi-class classification and I’m struggling to choose the right metric, any pointers would be very appreciated.

The project is basically predicting the label of patches of images pretty much like the amazon forest competition where a given patch can have multiple labels and I’m trying to find the best metric to evaluate the model performance, so far I opted for the f2 score like on the forest competition because I deeply care about False negatives. But actually, the f2 score is not really the right metric for the job, read more on that below.

Below is my naive search of the right threshold:

def get_thresholded_predictions(y_oh, threshold=0.5):
    y = copy.copy(y_oh)
    y[y >= threshold] = 1
    y[y != 1] = 0
    return y

def find_naive_threshold_on_fbeta_score(y_true_oh, y_pred_oh, beta=2):
    fb_list = []
    threshold_range = np.linspace(y_pred_oh.min(), y_pred_oh.max(), 100).tolist()
    for thres in threshold_range:
        y_pred = get_thresholded_predictions(y_pred_oh, threshold=thres)
        fb_list.append(fbeta_score(y_true_oh, y_pred, beta=beta, average='samples'))

    thres = threshold_range[int(np.argmax(fb_list))]
    return thres

So far so good (kinda) but now the issue is that I have a tons of entries with 0 classes , like a ton (removing them is not an option as I can’t do that during testing when I have no labels)… And inevitably that makes my f2 score very high when the threshold is very high.

For instance, if I have a threshold of 0.9, almost everything will be predicted as 0, but as 80% of my y_true_oh are 0 as well… in the end that matches 80% of my prediction making my f2 score very high.

Do you have any idea on how to overcome this problem? Should I be using another metric or is there something that I’m missing? Thanks a lot for your help!