Custom DataBunch for Saved Tensors

KarlH · July 1, 2019, 4:33am

I’m trying to create a custom dataset for saved torch tensors. The tensors are saved as .pt files and labeled via a pandas dataframe. These are the two classes I’ve created so far:

    class TensorItem(ItemBase):
        def __init__(self, data):
            self.data=self.obj=data
            
        def apply_tfms(self, tfms, **kwargs):
            return self
        
        def __repr__(self):
            return f'{self.__class__.__name__} {tuple(self.data.shape)}'

    class TensorList(ItemList):
        _bunch = DataBunch
        
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            
        def open(self, fn):
            return TensorItem(torch.load(self.path/fn))
        
        def get(self, i):
            fn = super().get(i)
            res = self.open(fn)
            return res

From these, I can create a set of LabelLists:

data = TensorList.from_df(df, path, 'data_col').split_by_rand_pct(0.1).label_from_df('label_col')

At this point, everything seems correct. I can index into the dataset like data.train[0][0].data and it returns a loaded tensor. Everything is correctly labeled. Issues start when I try to create a databunch.

I can run the databunch creation:

data = data.databunch(bs=32, num_workers=0)

But trying to grab a batch throws errors:

If I use multiple workers:

data = data.databunch(bs=32, num_workers=8)

Trying to grab a batch hangs indefinitely. On a keyboard interrupt, I get the following stack trace:

In both cases the issue seems to be with the proc_batch function. I think the solution is to create a custom dataloader, but I’m not sure what needs to be changed. The actual proc_batch function is in the DeviceDataLoader class, not the DataBunch class. Or is this something a custom collate function would solve? Ideas welcome.

sgugger · July 1, 2019, 1:01pm

I think your tensors have been saved on GPU, which causes an error when you laod them back with a DeviceDataLoader. I’d suggest adding a .cpu() in your init

TomB · July 1, 2019, 6:34pm

Yes, as sgugger points out torch is loading the saved tensors onto the original device which looks like it is a GPU, but pytorch’s DataLoader expects CPU tensors. As such it is trying to pin the tensors which is only appropriate for CPU tensors and causes that error when applied to a GPU tensor. There is a pin_memory argument to the DataLoader constructor you can set to False (pass it to the databunch() function and it is passed when creating the DataLoader). However then you need to take extra care if using multiple workers to avoid memory leaks or other erros (and support depends on platform and multiprocessing method). See https://pytorch.org/docs/stable/notes/multiprocessing.html for details. Though multiple workers may not provide much advantage if just loading saved tensors.
It may be easiest to load the tensors to the CPU as suggested, though as an alternate to calling .cpu() you can use the map_location argument of torch.load and avoid the overhead of moving them across to the GPU and back.

KarlH · July 1, 2019, 7:42pm

You’re both right. Tensors were saved on the GPU, which was causing issues. Specifying the CPU as the map location in the load step solves the problem.

    class TensorList(ItemList):
        _bunch = DataBunch
        
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            
        def open(self, fn):
            return TensorItem(torch.load(self.path/fn, map_location='cpu'))
        
        def get(self, i):
            fn = super().get(i)
            res = self.open(fn)
            return res