Learner recorder metric differs from sklearn metric

I’m training an image regression model and evaluating it using an r2 score. I noticed that the r2 score on the validation set was much higher than the r2 score on the test set, so I wanted to see if I could duplicate the metric result on the validation set. However, when I make predictions on the validation set using learn.get_preds, the r2 score is wildly different than the one being output by the recorder.

Functions:

def get_measures(f):
    image_id = int(f.name.split('.')[0])
    values = train_df_subset.loc[train_df.id == image_id, measures_cols].apply(np.log).values.flatten()
    return np.nan_to_num(values)

def get_image_files_subset(path):
    ids = train_df_subset.id
    return L([(path/f'{image_id}.jpeg') for image_id in ids])

def get_dls(image_size, bs=64):
    dls = DataBlock(
        blocks=(ImageBlock, RegressionBlock(n_out=6)),
        get_items=get_image_files_subset,
        splitter=FuncSplitter(lambda o: int(o.name.split(".")[0]) in val_ids),
        item_tfms=CropPad(image_size),
        get_y=get_measures).dataloaders(image_path, bs=bs)
    return dls


The validation set is a single batch of 64 images. As you can see the recorder output is 0.841, but the sklearn r2 score output is 0.042. Since I’m only using a single batch, why aren’t they the same?

Are you able to share your code in a Kaggle/Colab notebook?

It looks like sklearn.r2_score and fastai’s R2Score() give the same result if they are given the same predictions and targets (note that I have to pass preds[0] and targs[0] to r2_score otherwise it doesn’t compute correctly):



Here’s the code for copy/paste if you are interested:

from fastai.vision.all import *
import numpy as np
from sklearn import metrics

for _ in range(10):
  preds = torch.randint(low=0,high=10,size=(1,10))
  targs = torch.randint(low=0,high=10,size=(1,10))
  m1 = metrics.r2_score(y_true = targs[0], y_pred = preds[0])
  m2 = R2Score()(targs=targs, preds=preds)
  print(m1==m2)

Looking at the fastai docs it seems like R2Score() is derived from sklearn.r2_score

Thanks for the thorough answer! Yeah that’s what makes things really confusing. If the scores work the same then I’m definitely misunderstanding how the metric is calculated. I know it’s averaged over batches, which is why I only used one batch, but the scores are still very different. All the images are center-cropped during validation and inference, so I don’t think it’s a transformation issue either. It’s been a real head-scratcher for me.
Here’s the notebook:

1 Like

Okay I think I figured out at least how to make them equal—if you follow the source code for R2Score you’ll eventually come across this line where fastai flattens the targets and the predictions.

If you flatten valid_vals and preds (using .view(-1)) and then pass them to metrics.r2_score, you get the same value as R2Score:

In NumPy, this is equivalent to:

So I think one way to summarize this is that fastai’s implementation of R2Score is looking at the metric across the entire batch (lumping together all 6 predicted variables) whereas sklearn’s implementation is looking at the metric for each of the 6 variables in the batch and then taking the mean across those 6 R^2 values by default.

1 Like

Wow that makes complete sense thank you so much!!

1 Like