Custom dataset and dataloader for Learner

gyc599 · November 3, 2020, 5:12am

Hi fellows,

Recently, I am trying to use fit_one_cycle to train a model with a custom PyTorch dataset and Dataloader. The custom dataset is used to load data with size (256,256,9) from hdf5 files.

Codes:

import time, torch, os, h5py
from torch.utils.data import Dataset

class hdf5_dataset(Dataset):
def __init__(self, path, data_type='train', transform=None):
    self.file_path = path
    self.data = None
    self.label = None 
    self.data_type = data_type
    self.c = 17
    
    with h5py.File(self.file_path+data_type+'_data.h5', 'r') as file:
        self.len = len(file)
        
    self.transform = transform

def __len__(self):
    return self.len

def __getitem__(self, idx):
    
    if self.data is None:
        self.data = h5py.File(self.file_path+self.data_type+'_data.h5', 'r')
    if self.label is None:
        self.label = h5py.File(self.file_path+self.data_type+'_label.h5', 'r')
        
    self.data_list = list(self.data.keys())
    self.label_list = list(self.label.keys())
        
    image = self.data.get(self.data_list[idx]).value
    label = self.label.get(self.label_list[idx]).value 
    if self.transform:
        image = self.transform(image)
    return image, label
path = “path_to_hdf5_files”

batch_size=64
num_workers = 2

trainset = hdf5_dataset(path, ‘train’)
validset = hdf5_dataset(path, ‘valid’)

print(len(trainset), len(validset))

train_dl = DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)
valid_dl = DataLoader(validset, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True)

from fastai.vision.data import DataLoaders
data = DataLoaders(train_dl, valid_dl)

from fastai.vision.learner import cnn_learner, error_rate
import torch
from torchvision import models

learner_original = cnn_learner(data, models.resnet34, metrics=error_rate, pretrained=True)

torch.cuda.set_device(0)
learner_original.model.cuda()

learner_original.freeze()
learner_original.fit_one_cycle(5)

learner_original.unfreeze()
learner_original.fit_one_cycle(5)

AssertionError: n_out is not defined, and could not be inferred from data, set dls.c or pass n_out

Question:
I assume fastai.vision.data.DataLoaders can wrap two torch.utils.data.DataLoader and use to build learner, but obviously, I was wrong. So if I want to build a custom dataloader that can load data from hdf5 or numpy files (not images) for fastai’s Learner, how should I do?

Many thanks!

Pomo · November 3, 2020, 6:35pm

Hi Ethan. and welcome,

The way I dealt with this issue was to first import the fastai DataLoader (same name as PyTorch’s), and use it to construct the two DataLoader’s. fastai’s DataLoader seems to work the same as the PyTorch DataLoader, and eliminates the subsequent training error.

You also may need DataLoaders(train_dl, valid_dl).cuda() .

muellerzr · November 3, 2020, 6:45pm

All that you simply have to do here is assign your number of classes too. So just do:

data.c = 3 # for three classes

cnn_learner relies on this. It has some helper functions and items which makes it different from fastai’s base Learner class

gyc599 · November 9, 2020, 5:25pm

Thank you so much, fastai DataLoader worked well!

ozbejg · August 9, 2022, 6:58am

Hey guys, I have a very similar problem. I am trying to load the huggingface image dataset into my fastai vision_learner with the following code:

import torch
from datasets import load_from_disk
from fastai.vision.all import *

class CustomImageDataset(torch.utils.data.Dataset):
    def __init__(self, images, labels):
        self.images = images
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        image = self.images[idx]
        label = self.labels[idx]
        return np.array(image.resize((224,224), resample=0)), label


def get_dataset_huggingface():
    ds = load_from_disk("../output.hf")
    ds = ds.class_encode_column("L3")
    #num_classes = ds.features["L3"].num_classes
    ds = ds.train_test_split(test_size=0.2, seed=42)
    train_data = CustomImageDataset(ds["train"]["image"], ds["train"]["L3"])
    test_data = CustomImageDataset(ds["test"]["image"], ds["test"]["L3"])
    train_dataloader = DataLoader(train_data, batch_size=32)
    test_dataloader = DataLoader(test_data, batch_size=32)
    dls = DataLoaders(train_dataloader, test_dataloader)

    train_features, train_labels = next(iter(train_dataloader))
    print(f"Feature batch shape: {train_features.size()}")
    print(f"Labels batch shape: {train_labels.size()}")
    img = train_features[0].squeeze()
    label = train_labels[0]
    plt.imshow(img)
    plt.savefig("test.png")
    print(f"Label: {label}")

    learn = vision_learner(
            dls,
            resnet34,
            metrics=[error_rate, accuracy],
            concat_pool=True,
            splitter=default_split,
        ).to_fp16()
    return dls, [error_rate, accuracy]

However, I keep getting the following error:
AssertionError: "n_out" is not defined, and could not be inferred from data, set "dls.c" or pass "n_out"

I also tried dls = DataLoaders(train_dataloader, test_dataloader).cuda(), but I get the following error:
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Any idea what could be wrong?