I’m training an image regression model and evaluating it using an r2 score. I noticed that the r2 score on the validation set was much higher than the r2 score on the test set, so I wanted to see if I could duplicate the metric result on the validation set. However, when I make predictions on the validation set using learn.get_preds, the r2 score is wildly different than the one being output by the recorder.
Functions:
def get_measures(f):
image_id = int(f.name.split('.')[0])
values = train_df_subset.loc[train_df.id == image_id, measures_cols].apply(np.log).values.flatten()
return np.nan_to_num(values)
def get_image_files_subset(path):
ids = train_df_subset.id
return L([(path/f'{image_id}.jpeg') for image_id in ids])
def get_dls(image_size, bs=64):
dls = DataBlock(
blocks=(ImageBlock, RegressionBlock(n_out=6)),
get_items=get_image_files_subset,
splitter=FuncSplitter(lambda o: int(o.name.split(".")[0]) in val_ids),
item_tfms=CropPad(image_size),
get_y=get_measures).dataloaders(image_path, bs=bs)
return dls
The validation set is a single batch of 64 images. As you can see the recorder output is 0.841, but the sklearn r2 score output is 0.042. Since I’m only using a single batch, why aren’t they the same?
Are you able to share your code in a Kaggle/Colab notebook?
It looks like sklearn.r2_score
and fastai’s R2Score()
give the same result if they are given the same predictions and targets (note that I have to pass preds[0]
and targs[0]
to r2_score
otherwise it doesn’t compute correctly):
Here’s the code for copy/paste if you are interested:
from fastai.vision.all import *
import numpy as np
from sklearn import metrics
for _ in range(10):
preds = torch.randint(low=0,high=10,size=(1,10))
targs = torch.randint(low=0,high=10,size=(1,10))
m1 = metrics.r2_score(y_true = targs[0], y_pred = preds[0])
m2 = R2Score()(targs=targs, preds=preds)
print(m1==m2)
Looking at the fastai docs it seems like R2Score()
is derived from sklearn.r2_score
Thanks for the thorough answer! Yeah that’s what makes things really confusing. If the scores work the same then I’m definitely misunderstanding how the metric is calculated. I know it’s averaged over batches, which is why I only used one batch, but the scores are still very different. All the images are center-cropped during validation and inference, so I don’t think it’s a transformation issue either. It’s been a real head-scratcher for me.
Here’s the notebook:
1 Like
Okay I think I figured out at least how to make them equal—if you follow the source code for R2Score
you’ll eventually come across this line where fastai flattens the targets and the predictions.
If you flatten valid_vals
and preds
(using .view(-1)
) and then pass them to metrics.r2_score
, you get the same value as R2Score
:
In NumPy, this is equivalent to:
So I think one way to summarize this is that fastai’s implementation of R2Score
is looking at the metric across the entire batch (lumping together all 6 predicted variables) whereas sklearn
’s implementation is looking at the metric for each of the 6 variables in the batch and then taking the mean across those 6 R^2 values by default.
1 Like
Wow that makes complete sense thank you so much!!
1 Like