Efficiently passing data to test_dl

abhivij · August 14, 2024, 10:13am

Hi all,

As part of Kaggle ISIC 2024 Skin Cancer Detection competition, I’m trying to create a test dataloader using a trained model - ‘learn_imp’
The input images are in hdf5 file.
So to pass the test set images as list I use the below

test_dls = learn_imp.dls.test_dl([np.array(Image.open(BytesIO(test_data[isic_id][()]))) for isic_id in test_meta_data.isic_id])

and then

preds, _ = learn_imp.get_preds(dl = test_dls)

This works, but takes up too much memory.
(I’m not sure though if these images get transformed to 256 size as I specified in dataloader definition - given at the end of this post ?!)

So instead, I’m trying to pass just the index and then convert to image inside a function/transform.
The below is what I currently have.

def get_image_by_id(data, isic_id):
    try:
        image = np.array(Image.open(BytesIO(data[isic_id][()])))
        return image
    except Exception as e:
        print(f"Error loading image with ID {isic_id}: {e}")
        return None

def get_test_item(i):
    return get_image_by_id(test_data, test_meta_data.loc[i, 'isic_id'])

test_dls = learn_imp.dls.test_dl(test_meta_data.index.tolist(), create_item = get_test_item, after_item = Resize(256, method='squish'))

However, I’m faced with the following error :

RuntimeError: Error when trying to collate the data into batches with fa_collate, at least two tensors in the batch are not the same size.

Mismatch found on axis 0 of the batch and is of type `ndarray`:
	Item at index 0 has shape: (141, 3)
	Item at index 1 has shape: (125, 3)

Please include a transform in `after_item` that ensures all data of type ndarray is the same size

Could someone please guide me on the correct paremeters to pass to test_dl() ?

In case its helpful, I used the below code to create train dataloaders


train_data = h5py.File("/kaggle/input/isic-2024-challenge/train-image.hdf5", "r")

def get_image_by_id(data, isic_id):
    try:
        image = np.array(Image.open(BytesIO(data[isic_id][()])))
        return image
    except Exception as e:
        print(f"Error loading image with ID {isic_id}: {e}")
        return None
    
def get_items(meta_data):
    return meta_data.index.tolist()

def get_x_train(i):
    return get_image_by_id(train_data, train_meta_data.loc[i, 'isic_id'])

def get_y_train(i):
    return train_meta_data.loc[i, 'target']

dls = DataBlock(
    blocks=(ImageBlock,CategoryBlock),
    get_items=get_items,
    get_x=get_x_train,
    get_y=get_y_train,
    splitter=RandomSplitter(valid_pct=0.2,seed=0),
    item_tfms=[Resize(256, method='squish')],
    batch_tfms=aug_transforms()
).dataloaders(train_meta_data,bs=256)

abhivij · August 14, 2024, 6:11pm

I got a solution for this without using extra fast ai arguments for test_dl, but by creating predictions on subset of test_data multiple times. That works for now !

test_dls = learn_imp.dls.test_dl([np.array(Image.open(BytesIO(test_data[isic_id][()]))) for isic_id in test_meta_data.isic_id[start:end]])