Problem loading my partitioned dataset into a datablock

kuro_inu · September 6, 2023, 7:25am

Hello everyone!, I would appreciate some help with a simple problem i am stuck at. You see, I am working with the MedMNISTv2 dataset URL which is downloaded with the .npz extension. The people who created the dataset also created a library in Pytorch to interact with.

I have downloaded the data with the following lines:

DataClass = getattr(mydataset, info[‘python_class’])

train_dataset = DataClass(split=‘train’, transform=data_transform, download=download)
test_dataset = DataClass(split=‘test’, transform=data_transform, download=download)

So now, I would like to load my data into a dataloader to train a model. However, in contrast with previous fastai versions, I cannot find a way to specify that I already have a training partition and a validation partition. I don’t need a splitter.

For datasets which are not partitioned and the images are located into their respective folder, I normally use:

ultrasound = DataBlock(blocks = (ImageBlock, CategoryBlock), get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
batch_tfms=aug_transforms(min_scale=0.75, do_flip=True, flip_vert=False, max_warp=0.1, max_rotate=5))

TLDR: I cannot find a way to specify directly my partitions in the DataBlock API, any suggestions?, thank you in advance.

vbakshi · September 6, 2023, 7:07pm

Would ImageDataLoaders.from_path work for your situation?

Or another thought is creating two individual DataLoader objects from Datasets, one for training set (dl) and one for validation set (valid_dl) and then combining them:

dls = DataLoaders(dl, valid_dl)

As they’ve done in the MNIST example notebook:

kuro_inu · September 7, 2023, 4:31am

Thank you for your kind response!, I will try both of them and report back. It is quite interesting that the post below mine suggests a similar approach, using the native Pytorch dataloaders and fuse them with fast.ai (URL). The solution was proposed for a different problem. Anyway, I will try!

Update: The objective was to create a dataloaders from my dataset and your second suggestion did it, it was so simple and the answer lies in the book. Thank you, my friend!

So in theory, I can now proceed to train a model. However I am now facing a new problem, I would like to perform batch data augmentation but since I am not using a DataBlock, I am trying to find out how to do it. I will open a new post if I don’t succeed.