Fmeasure in keras

Has anyone used fmeasure instead of accuracy for multilclass classification ? It seems it only works for binary classification but not accurately for multiclass.

model.compile(loss=‘categorical_crossentropy’, optimizer=Adam(), metrics=[‘fmeasure’])

It’s worked for me. As long as you have two or more outputs, it’s the weighted harmonic mean of precision and recall. (For actual binary output on a single label, only thing that works is binary_crossentropy IIRC.)

If you want to see what it’s doing, check out the source code (fmeasure is fbeta_score with beta = 1):

# modified from Keras source code

true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

p = true_positives / (predicted_positives + K.epsilon())
r = true_positives / (possible_positives + K.epsilon())

beta = 1 # fmeasure
bb = beta**2

 fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon())

Thanks David for your response. I have already looked at the source code and at first glance believed it’s accurate. In my case fmeasure is giving .72 for 30 classes and when I analyzed the confusion matrix, apart from 4 to 5 classes other 25 were giving only 30-40% accurate then I realized that it’s fishy in multiclass scenario.

@jeremy @rachel Any thoughts on this is appreciated as this behaviour with multiclass seems to be weird.
What would be the best evaluation metric for any multilcass NLP classification tasks ?