Cross validation code help

nadl · February 8, 2024, 11:02pm

Hi,
I am trying cross validation on my dataset following this example - Am I doing k-fold cross validation right? . But I get this warning -

UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior.

So I think that not all the labels are shown to the model. I am not sure if my cross validation code is correct. Could anyone help me in correcting this code.
Thankyou in advance.

train_df, test_df = train_test_split(df, train_size= 0.8, test_size=0.2)

X = train_df['filename'].to_numpy()
y = train_df['label'].to_numpy()

def get_x(train_df): return train_df['filename']
def get_y(train_df): return train_df['label']

def get_test_x(test_df): return test_df['filename']
def get_test_y(test_df): return test_df['label']

folds = 5
skf = StratifiedKFold(n_splits=folds, shuffle=True)

batch_size = 16

for train_index, val_index in skf.split(train_df.index, y):
    
    train_block = DataBlock(
            blocks=(ImageBlock, CategoryBlock),
            get_x=get_x,
            get_y=get_y,
            splitter=IndexSplitter(val_index), # added val_index,
            item_tfms=[Resize((600,1000), method = ResizeMethod.Pad, pad_mode='border')],
            batch_tfms=[Normalize.from_stats(*imagenet_stats)]
        )
    test_block = DataBlock(
            blocks=(ImageBlock, CategoryBlock),
            get_x=get_test_x,
            get_y=get_test_y,
            item_tfms=[Resize((600,1000), method = ResizeMethod.Pad, pad_mode='border')],
            batch_tfms=[Normalize.from_stats(*imagenet_stats)]
        )
    
    train_dl = train_block.dataloaders(train_df, bs=batch_size)
    test_dl = test_block.dataloaders(test_df, bs=batch_size)
    learn = Learner(train_dl, xresnet34(n_out=24), model_dir=model_dir, metrics=[accuracy, Precision(average='weighted'), Recall(average='weighted')])
    learn.model.cuda()
    learn.fit_one_cycle(5)
    
    val = learn.validate()
    learn.dls.valid = test_dl.valid
    test = learn.validate()

    print('done, appending results.. \n')
    val_pct.append(val)
    test_pct.append(test)

Conwyn · February 10, 2024, 9:12am

Hi Nadl

This is my GUESS. If you have multiple labels L1 to Ln then if you do cross validation which basically takes the whole dataset and creates 5 train and test and potentially one of the 5 test set may not contain any examples of label x. It might be worth counting the frequency of labels in the whole dataset.

Please note this is my GUESS. I might be completely wrong.

Regards Conwyn