It seems the fastai library takes the whole dataset into memory which is not possible for huge datasets. Is there a way to process such datasets with the library or you have to implement it yourself?
Images are always loaded in batches in memory?
I see in from_csv
(for example):
datasets = cls.get_ds(f, (trn_fnames,trn_y), (val_fnames,val_y), tfms, path=path, test=test_fnames)
So we don`t have data loading here and we have only filenames and labels. And only when we do forward / back-propagation we request for specific batch i and this is when some loading is happening:
class FilesDataset(BaseDataset):
def __init__(self, fnames, transform, path):
self.path,self.fnames = path,fnames
super().__init__(transform)
def get_x(self, i):
flags = cv2.IMREAD_UNCHANGED+cv2.IMREAD_ANYDEPTH+cv2.IMREAD_ANYCOLOR
fn = os.path.join(self.path, self.fnames[i])
return cv2.cvtColor(cv2.imread(fn, flags), cv2.COLOR_BGR2RGB).astype(np.float32)/255
You are right, I was wrong, it is not taking everything into memory. Rather when you have a huge dataset with a csv file, the csv parsing fails and the kernel dies. I think it is because of the huge dictionary it is creating.
Do you have the error message ? Lets read it altogether
There is no explicit error actually, the kernel just dies and restarts. I used print messages to trace that it is dying inside the nhot_labels function while creating the all_idx dictionary.
oh, I see. How big is your dataset (images, classes)? Do you see real memory overload? Maybe some other error with classes oh encoding like nan?