Why does my train/val. split split into 25% and not the default 20%

krullmizter · April 24, 2023, 9:12am

Hi,

I’m using the DataLoaders class to load, and transform my data before training. However when I try to log some stats on the different datasets I find that the validation dataset is split to 25% of the training dataset and not the default 20%. I’ve tried to use valid_pct(0.2) but the same percentage is still used, If I add valid_pct(0.1) the percentage is 11% not 10%. Any ideas?

dls = ImageDataLoaders.from_csv(
    path=dataset_dir,
    folder='train',
    test='test',
    suff='.jpg',
    size=sz,
    bs=bs,
    item_tfms=item_tfms, 
    batch_tfms=batch_tfms
)

train_len = len(dls.train_ds)
val_len   = len(dls.valid_ds)
test_len  = len(os.listdir(test_dir))

val_pct = round((val_len/train_len * 100))

print(f'Amount of images in each dataset\nTotal: { (train_len + val_len) + test_len }.\n')
print(f'Train: {train_len}\nValidation: {val_len} ({val_pct}% of train) \nTest: {test_len}')

lucasvw · April 24, 2023, 10:40am

Hi @krullmizter,

You should probably do “val_len / (val_len + train_len)” instead of “val_len / train_len”. The former is the percentage validation with respect to all your data, the later is the percentage validation with respect to your training set.

Let’s say you have 5 items, the former will give you 1/5 = 0.2, the latter will give you 1 / 4 = 0.25.

krullmizter · April 24, 2023, 10:53am

Ahh great, I must have stared at my code for too long, thx for the fast and easy answer!