FastAi metrics vs manual metrics

I am training a segmentation model with FastAi. When training, FastAi states that the Dice score metric is around 0.92.

I am using medical images and so want to split the validation data by patient. I have done this by using funcsplitter and checking whether or not the patient number (which is in the filename) is in a list.

I then manually assessed the dice score of the model with the same validation data. I copied the images into a separate directory, created a test dataloader for each image, one by one, and used the model to conduct inference on the images. I then evaluated the dice score manually with the following code:

def dice(gt, pred):
    if (np.sum(gt)==0) and (np.sum(pred)==0):
        return 1
    intersection = np.sum(gt * pred)
    return (2*intersection) / (np.sum(gt) + np.sum(pred))

The average (mean) dice score was around 0.84, significantly lower than what fastai is printing out.

I am trying to decipher why there is a difference? Is my dice score code incorrect? Is fastai using the train or valid set for metrics? Is fastai outputting possibly median (etc.) dice score rather than mean? Is it because I am inputting the images one by one (batch size=1) for inference, when it was trained with batch size of 16 (I don’t think this should be an issue). I am doing it one by one to keep track of filenames, because I am saving the masks and then creating an ensemble of segmentation models (averaging the masks). Is there possibly a problem with how I am conducting inference and then saving the images? As shown below:

fnames = get_image_files('./valid')
fnames = sorted(fnames)

regex = '[ \w-]+?(?=\.)'
models = [unet, deeplabv3, hrnet]
names = ['unet', 'deeplabv3', 'hrnet']

for i in fnames:
    dl = dls.test_dl([i])
    file_path = str(i)
    matches = re.search(regex, file_path)
    filename = matches[0]
    for i in range(len(models)):
        model = models[i]
        name = names[i]
        preds = model.get_preds(dl=dl)
        pred_arx = preds[0][0].argmax(dim=0).numpy()
        rescaled = (255.0 / pred_arx.max() * (pred_arx - pred_arx.min())).astype(np.uint8)
        rescaled[rescaled==255] = 1
        im = Image.fromarray(rescaled)
        im.save(f'preds_{name}/{filename}_pred_{name}.png')

Thank you.

How did you set up your DataLoader / data block? I suspect there is a some Transform that you’re not applying.

def func(o):
    regex = '\d{4}'
    matches = re.search(regex, str(o))
    if int(matches.group(0)) in cross_val:
        return True
    else:
        return False
    
codes = np.array(['background', 'prostate'])

def label_func(x): return path/'Cuocolo_masks'/f'{x.stem}_mask.png'

def get_dls(bs, size):
    if size==0:
        db = DataBlock(blocks = (ImageBlock(), MaskBlock(codes)),
                      splitter = FuncSplitter(func),
                      get_items = get_image_files,
                      get_y = label_func,
                      batch_tfms = [*aug_transforms(), Normalize.from_stats(*imagenet_stats)])
    else:
        db = DataBlock(blocks = (ImageBlock(), MaskBlock(codes)),
                       splitter = FuncSplitter(func),
                       get_items = get_image_files,
                       get_y = label_func,
                       batch_tfms = [*aug_transforms(), Normalize.from_stats(*imagenet_stats)],
                       item_tfms = Resize(size))
    return db.dataloaders('./Cuocolo_2-5D', bs=bs)

dls = get_dls(16, 0)

cross_val is a list with random integers, corresponding to the patient numbers which are part of the validation set.

Edit: But shouldn’t the transforms apply automatically because when setting up the test dataloader, it is derived from the original dataloader (dl = dls.test_dl([i]) ) -

Create a test dataloader from test_items using validation transforms of dls

as quoted from the fastai docs ( Data core | fastai)

If I were you I would come up with a bunch of tensors of the right shape, and put them through your dice function and the fastai dice function, and make sure you get the same answer. Once that’s done you’ve eliminated one possibility of difference.

Just checked, they both return the same scores.

I removed aug_transforms and normalize.from_stats from batch transforms in the datablock, and it slightly reduced the fastai dice score (0.918 to 0.917 - basically the same) and slightly increased my own dice score (0.832 to 0.847).

Fixed this.
Fastai averages metrics over batches and then returns final average.
For my metric, I should have been combining all images for one patient into one array and then calculating the dice score. I did this, and then the two values were almost equal (0.91 (mine) vs 0.92 (fastai)).