Fmeasure in keras

janardhanp22 · February 6, 2017, 11:26pm

Has anyone used fmeasure instead of accuracy for multilclass classification ? It seems it only works for binary classification but not accurately for multiclass.

model.compile(loss=‘categorical_crossentropy’, optimizer=Adam(), metrics=[‘fmeasure’])

davecg · February 7, 2017, 1:02am

It’s worked for me. As long as you have two or more outputs, it’s the weighted harmonic mean of precision and recall. (For actual binary output on a single label, only thing that works is binary_crossentropy IIRC.)

If you want to see what it’s doing, check out the source code (fmeasure is fbeta_score with beta = 1):

github.com

keras-team/keras/blob/master/keras/metrics.py

"""Built-in metrics.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import six
from . import backend as K
from .losses import mean_squared_error
from .losses import mean_absolute_error
from .losses import mean_absolute_percentage_error
from .losses import mean_squared_logarithmic_error
from .losses import hinge
from .losses import logcosh
from .losses import squared_hinge
from .losses import categorical_crossentropy
from .losses import sparse_categorical_crossentropy
from .losses import binary_crossentropy
from .losses import kullback_leibler_divergence
from .losses import poisson

This file has been truncated. show original


# modified from Keras source code

true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))

p = true_positives / (predicted_positives + K.epsilon())
r = true_positives / (possible_positives + K.epsilon())

beta = 1 # fmeasure
bb = beta**2

 fbeta_score = (1 + bb) * (p * r) / (bb * p + r + K.epsilon())

janardhanp22 · February 7, 2017, 1:21am

Thanks David for your response. I have already looked at the source code and at first glance believed it’s accurate. In my case fmeasure is giving .72 for 30 classes and when I analyzed the confusion matrix, apart from 4 to 5 classes other 25 were giving only 30-40% accurate then I realized that it’s fishy in multiclass scenario.

janardhanp22 · February 11, 2017, 6:04pm

@jeremy @rachel Any thoughts on this is appreciated as this behaviour with multiclass seems to be weird.
What would be the best evaluation metric for any multilcass NLP classification tasks ?