How to load a huge image dataset with fastai library?

It seems the fastai library takes the whole dataset into memory which is not possible for huge datasets. Is there a way to process such datasets with the library or you have to implement it yourself?

Images are always loaded in batches in memory?

I see in from_csv (for example):

datasets = cls.get_ds(f, (trn_fnames,trn_y), (val_fnames,val_y), tfms, path=path, test=test_fnames)

So we don`t have data loading here and we have only filenames and labels. And only when we do forward / back-propagation we request for specific batch i and this is when some loading is happening:

class FilesDataset(BaseDataset):
    def __init__(self, fnames, transform, path):
        self.path,self.fnames = path,fnames
        super().__init__(transform)
    def get_x(self, i):
        flags = cv2.IMREAD_UNCHANGED+cv2.IMREAD_ANYDEPTH+cv2.IMREAD_ANYCOLOR
        fn = os.path.join(self.path, self.fnames[i])
        return cv2.cvtColor(cv2.imread(fn, flags), cv2.COLOR_BGR2RGB).astype(np.float32)/255

You are right, I was wrong, it is not taking everything into memory. Rather when you have a huge dataset with a csv file, the csv parsing fails and the kernel dies. I think it is because of the huge dictionary it is creating.

1 Like

Do you have the error message ? Lets read it altogether :sunglasses:

There is no explicit error actually, the kernel just dies and restarts. I used print messages to trace that it is dying inside the nhot_labels function while creating the all_idx dictionary.

oh, I see. How big is your dataset (images, classes)? Do you see real memory overload? Maybe some other error with classes oh encoding like nan?