Combining Dataloaders

NimaC · September 25, 2020, 11:02am

I have a weird use-case that i cant seem to get working. My Data consists of “Images” that are grouped into “Studies”. Each Study either consists of a “Frontal” x-ray Image or a “Frontal” x-ray and “Lateral” x-ray Image. In the case where 2 Images are available, i want the forward pass to consist of torch.max(pred_frontal, pred_lateral) basically. Where there is only 1 Image available, there should only be the prediction of the frontal image. For Example:

def forward(self, x1, x2=None):
        ftrs_frontal = self.encoder(x1)
        pred_frontal = self.head(ftrs_frontal)
        if x2 != None:
            ftrs_lateral = self.encoder(x2)
            pred_lateral = self.head(ftrs_lateral)
            return torch.max(pred_frontal, pred_lateral)
        return pred_frontal

Now the Problem is, the forward function gets batches as Parameters and I dont want these two cases batched together. What i need is two separate batches for the inputs, where they dont mix. What I’ve been doing is first training on one Study group and then on the other, which isnt ideal.

def get_only_lateral_studies_data_loader(df_path):
    df = pd.read_csv(df_path)
    train_df = df.loc[(df['valid'] == False) & (df['Lateral'] != 'black.jpg')]
    valid_df = df.loc[(df['valid'] == True) & (df['Lateral'] != 'black.jpg')]
    train_df.reset_index(inplace=True)
    valid_df.reset_index(inplace=True)
    train_tl= TfmdLists(range(len(train_df)), StudyTransform(train_df))
    valid_tl= TfmdLists(range(len(valid_df)), StudyTransform(valid_df))
    dls = DataLoaders.from_dsets(train_tl, valid_tl,
                             after_item=[ToTensor], 
                             after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats), *aug_transforms()])
    dls = dls.cuda()
    return dls

def get_only_frontal_studies_data_loader(df_path):
    df = pd.read_csv(df_path)
    df = df.loc[df['Lateral'] == 'black.jpg']
    df[target_label[0]] = df[target_label[0]].astype(bool)
    return ImageDataLoaders.from_df(df=df, path=path, fn_col='Frontal', valid_col='valid', label_col=target_label, batch_tfms=aug_transforms())

Is there a way of combining these dataloaders and getting one batch from each dataloader, which i can feed into the network, or am I thinking about this wrong?

muellerzr · September 25, 2020, 11:05am

See this technique, it does exactly what you want: Combining Tabular + Images in fastai2 (and should work with almost any other type)

NimaC · September 26, 2020, 2:55pm

Thanks for the fast reply, will definitely try this. Cheers!

NimaC · September 28, 2020, 12:43pm

This approach mixes the Datasets into the same batch right?

I want the Dataloader to provide the datasets in seperate batches

muellerzr · September 28, 2020, 12:46pm

They’ll be in the same batch, yes. But it’ll be in a format like:

x1, x2, y

Where x1 is the batch from the first and x2 is the batch from the second DataLoader. The Datasets are kept seperate and seperate augmentations are applied based on the DL.

NimaC · September 28, 2020, 7:19pm

The datasets dont share the same output in my case. Its not a multimodal problem. I just need the dataloader to randomly return a batch from one of the two datasets/loaders.

muellerzr · September 28, 2020, 7:39pm

If you just want it to randomly return a batch, change the __iter__ here:

batch.extend(dl.dls[0].after_batch(b[0])[:2]) # tabular cat and cont
batch.append(dl.dls[1].after_batch(b[1][0])) # Image

As that controls how the batch is made. So you could have some random.random where if n > y return both, etc.

hansenms · September 28, 2020, 10:22pm

This seems be be along the lines of the question I asked here: Batch grouping - fastai users - Deep Learning Course Forums. Basically, in my training set I have some (slightly) different training sets and only sets of the same kind should be combined in a batch (in my case due to the dimensions of the data varying slightly). However, I can’t quite seem to connect the dots and see if I can make the approach you (@muellerzr) outline work.

NimaC · September 30, 2020, 10:08am

The datasets are disjoint. They even have different lengths. For the Model training I want a way to first go through one of dataloaders completely and afterwards the other. So one Epoch would consist of going through both datasets completely.

I tried something like this, but it stops the epoch too early. (after len(self.dls[0])*2 basically)

    def __len__(self): return len(self.dls[0]) + len(self.dls[1])
    
    def __iter__(self):
        for dl in self.dls:
            z = _loaders[dl.fake_l.num_workers==0](dl.fake_l)
            for b in z:   
                inps = []
                outs = []
                if self.device is not None: 
                    b = to_device(b, self.device)
                batch = dl.after_batch(batch)
                inps += batch[:dl.n_inp]
                outs += batch[dl.n_inp:]
                #inps = L(inps)[self.x_idxs]
                #outs = L(outs)[self.y_idxs]
                yield (inps, outs)

NimaC · October 1, 2020, 12:18am

I did a little research and you can chain generators by doing the following:

class MixedDL():
...
        
    def __iter__(self):
        yield from self.dls[0]
        yield from self.dls[1]

    def __len__(self): return len(self.dls[0]) + len(self.dls[1])

...

The expected behaviour would be that the items of the first generator will be “yielded” until theres none left and after that the items of the second generator. As mentioned before, the dataloaders have different lengths:

len(dls_frontal[0]) -> 688
len(dls_lateral[0]) -> 207
mixed_train = MixedDL(dls_frontal[0], dls_lateral[0])
mixed_valid = MixedDL(dls_frontal[1], dls_lateral[1])
dls = DataLoaders(mixed_train, mixed_valid).cuda()

When I run the following code I get an IndexError in DataLoader:

count = 0
test = iter(dls.train)
while True:
    count +=1
    next(test)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-217-2314ae7032b5> in <module>
      3 while True:
      4     count +=1
----> 5     next(test)

<ipython-input-212-02a3bea807b0> in __iter__(self)
     20     def __iter__(self):
     21         yield from self.dls[0]
---> 22         yield from self.dls[1]
     23 
     24     def one_batch(self):

/opt/conda/lib/python3.7/site-packages/fastai/data/load.py in __iter__(self)
    100         self.before_iter()
    101         self.__idxs=self.get_idxs() # called in context of main process (not workers/subprocesses)
--> 102         for b in _loaders[self.fake_l.num_workers==0](self.fake_l):
    103             if self.device is not None: b = to_device(b, self.device)
    104             yield self.after_batch(b)

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    361 
    362     def __next__(self):
--> 363         data = self._next_data()
    364         self._num_yielded += 1
    365         if self._dataset_kind == _DatasetKind.Iterable and \

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    987             else:
    988                 del self._task_info[idx]
--> 989                 return self._process_data(data)
    990 
    991     def _try_put_index(self):

/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _process_data(self, data)
   1012         self._try_put_index()
   1013         if isinstance(data, ExceptionWrapper):
-> 1014             data.reraise()
   1015         return data
   1016 

/opt/conda/lib/python3.7/site-packages/torch/_utils.py in reraise(self)
    393             # (https://bugs.python.org/issue2651), so we work around it.
    394             msg = KeyErrorMessage(msg)
--> 395         raise self.exc_type(msg)

IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
    data = fetcher.fetch(index)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 34, in fetch
    data = next(self.dataset_iter)
  File "/opt/conda/lib/python3.7/site-packages/fastai/data/load.py", line 111, in create_batches
    yield from map(self.do_batch, self.chunkify(res))
  File "/opt/conda/lib/python3.7/site-packages/fastcore/utils.py", line 381, in chunked
    res = list(itertools.islice(it, chunk_sz))
  File "/opt/conda/lib/python3.7/site-packages/fastai/data/load.py", line 124, in do_item
    try: return self.after_item(self.create_item(s))
  File "/opt/conda/lib/python3.7/site-packages/fastai/data/load.py", line 130, in create_item
    def create_item(self, s):  return next(self.it) if s is None else self.dataset[s]
  File "/opt/conda/lib/python3.7/site-packages/fastai/data/core.py", line 278, in __getitem__
    res = super().__getitem__(idx)
  File "/opt/conda/lib/python3.7/site-packages/fastcore/foundation.py", line 219, in __getitem__
    def __getitem__(self, idx): return self._get(idx) if is_indexer(idx) else L(self._get(idx), use_list=None)
  File "/opt/conda/lib/python3.7/site-packages/fastcore/foundation.py", line 223, in _get
    if is_indexer(i) or isinstance(i,slice): return getattr(self.items,'iloc',self.items)[i]
IndexError: list index out of range

count -> 689

It throws this error exactly at the end of the first generator and beginning of the second generator

chris_swart · December 14, 2020, 4:06pm

Have you managed to move ahead with your problem? I have a similar use case.