This is great Zach!
I got stuck on the same issue as @bwarner, but couldn’t seem to fix it. How would you go about creating mixedDL1 and mixedDL2?
I tried:
You cannot use cnn_learner with this DataLoader as those models won’t work. You need to generate custom architecture. So it’s incomparable there is not fix.
I have a generic version incoming that will be put in as a PR soon. All that’s needed is feeding in your DataLoaders and it will grab the appropriate x’s and y’s (without repeats).
Here’s the big behemoth and all of it’s glory:
class MixedDL():
def __init__(self, *dls, device='cuda:0'):
"Accepts any number of `DataLoaders` and a device"
self.device = device
for dl in dls: dl.shuffle_fn = self.shuffle_fn
self.dls = dls
self.count = 0
self.fake_l = _FakeLoader(self, False, 0, 0)
self._get_idxs()
def __len__(self): return len(self.dls[0])
def _get_vals(self, x):
"Checks for duplicates in batches"
idxs, new_x = [], []
for i, o in enumerate(x): x[i] = o.cpu().numpy().flatten()
for idx, o in enumerate(x):
if not arrayisin(o, new_x):
idxs.append(idx)
new_x.append(o)
return idxs
def _get_idxs(self):
"Get `x` and `y` indicies for batches of data"
dl_dict = dict(zip(range(0,len(self.dls)), [dl.n_inp for dl in self.dls]))
inps = L([])
outs = L([])
for key, n_inp in dl_dict.items():
b = next(iter(self.dls[key]))
inps += L(b[:n_inp])
outs += L(b[n_inp:])
self.x_idxs = self._get_vals(inps)
self.y_idxs = self._get_vals(outs)
def __iter__(self):
z = zip(*[_loaders[i.fake_l.num_workers==0](i.fake_l) for i in self.dls])
for b in z:
inps = []
outs = []
if self.device is not None:
b = to_device(b, self.device)
for batch, dl in zip(b, self.dls):
batch = dl.after_batch(batch)
inps += batch[:dl.n_inp]
outs += batch[dl.n_inp:]
inps = L(inps)[self.x_idxs]
outs = L(outs)[self.y_idxs]
yield (inps, outs)
def one_batch(self):
"Grab one batch of data"
with self.fake_l.no_multiproc(): res = first(self)
if hasattr(self, 'it'): delattr(self, 'it')
return res
def shuffle_fn(self, idxs):
"Shuffle the internal `DataLoaders`"
if self.count == 0:
self.rng = self.dls[0].rng.sample(idxs, len(idxs))
self.count += 1
return self.rng
if self.count == 1:
self.count = 0
return self.rng
def show_batch(self):
"Show a batch of data"
for dl in self.dls:
dl.show_batch()
def to(self, device): self.device = device
And it’s helper:
def _arrayisin(arr, arr_list):
"Checks if `arr` is in `arr_list`"
for a in arr_list:
if np.array_equal(arr, a):
return True
return False
So what was needed?
I had to figure out a way to first check how many outputs we had (normally), and then check that all of our y’s are unique, in case we merged two DataLoaders together who both had similar get_y's (or repeated x’s). This is done in the _get_idxs and _get_vals functions.
Awesome job @muellerzr. I have a really dumb question. Can this combination of text and images be done on the Datasets level instead of the Dataloaders level just as you have shown. My idea is to make everything into the datasets class and then use the generic fastai dataloader. It’s just a thought i had
No. We do it at the DataLoader level to avoid headaches of dealing with transforms. It’s a DataLoader of DataLoaders, never interrupting each’s pipeline. (You’d need to do this at that level because of the augmentation pipelines, even tabular has GPU transforms which get the batches)
I understand what you mean. The reason i had this thought is because if i were to do this in normal pytorch code, I’d do it on the dataset level. I still feel the transforms could be applied with typedispatch to the different tensortypes defined in a mixedDatasets and still be able to account for GPU transforms. I also agree that it’ll be a bit of a headache to implement. I’ll still look in to it and update if i have any good results.
I think for this problem (ISIC competition) the code should be something like this:
dblock = DataBlock(
blocks=(ImageBlock(cls=PILDicom), # image_name
CategoryBlock, # sex
CategoryBlock, # anatom_site_general_challenge
RegressionBlock, # age_approx
CategoryBlock)), # target
getters=[ColReader('image_name', pref=train_path, suff='.dcm'),
ColReader('sex'),
ColReader('anatom_site_general_challenge'),
ColReader('age_approx'),
ColReader('target')],
n_inp=4, # Set the number of inputs
item_tfms=Resize(128),
splitter=...,
batch_tfms=...
)
Or use get_x + get_y instead of getters + n_inp.
What do you think about it? How should be the transformation augmentation when you have multiple data types? I’m new to the library, I hope you find this approach interesting.
No. Bengali is fine, they’re all image inputs, the DataBlock API is expected to work as such here. I even made a tutorial notebook myself explaining this. In a multi-modal scenario (what this is designed for), we have multipledifferent input types. Such as tabular + text, images + tabular + text, so on and so forth. There is not an easy way to bring this into the library, as it involves headaches with the transform pipelines, how do you deal with when you just want to augment your images? How do you make sure your batches all come from the same place? This is what the MixedDL attempts to solve. What you describe is just a simple scenario where it works (While yes technically that is multimodal, it’s a multimodal where the inputs are all the exact same, not what this is designed for). Does this help some?
Thanks. Yes, headaches with the transform pipelines seems to be the main issue with a single DataBlock. Maybe a list of tmfs (each one for a block) could be a solution for future versions of the library. Here is some pseudocode:
See this thread to why that can be problematic. There’s a lot of workarounds needed here, as the text transforms/API is not the same as the vision, and tabular is a ballpark of its own:
(Notice instead of dealing with the DataBlock API we instead deal with TabularPandas, as tabular operates with this). This method avoids that ones headache, and thanks to the generic method, requires almost no overhead from the user.
If you can find a more successful route please let me know but I’ve been trying to solve this problem for a few months now and this is what I’ve discovered is the best solution. (And Sylvain agrees too)
Are those not all y’s you have there though besides the image input? If they weren’t, your cont vars need to be normalized too, and your cat vars need to be converted to integers with potential FillMissing, aka the entirety of TabularPandas. They have their own seperate preprocessing you need to take into account
And preprocessing everything entirely beforehand isn’t very efficient, and gets rid of batch and item transforms
(Though we may discover a few ways for doing this task, so keep at it )
the line b = next(iter(self.dls[key])) returns a TfmDL object for me which is not subscriptable and thus returns an error, although I pass Dataloader objects into the function. Im a little confused.
Edit: ah, I did not mention this in the thread so far! Apologies! (BTW will be moving this over to walkwithfastai.com this week, so it’ll be a more flushed out tutorial ) I’ll likely make a helper function to do this as well.