No. Bengali is fine, they’re all image inputs, the DataBlock API is expected to work as such here. I even made a tutorial notebook myself explaining this. In a multi-modal scenario (what this is designed for), we have multiple different input types. Such as tabular + text, images + tabular + text, so on and so forth. There is not an easy way to bring this into the library, as it involves headaches with the transform pipelines, how do you deal with when you just want to augment your images? How do you make sure your batches all come from the same place? This is what the MixedDL
attempts to solve. What you describe is just a simple scenario where it works (While yes technically that is multimodal, it’s a multimodal where the inputs are all the exact same, not what this is designed for). Does this help some?
Thanks. Yes, headaches with the transform pipelines seems to be the main issue with a single DataBlock. Maybe a list of tmfs (each one for a block) could be a solution for future versions of the library. Here is some pseudocode:
DataBlock(
blocks=[input_dataBlock_1,
input_dataBlock_2,
output_dataBlock_1
output_dataBlock_2],
getters=[getter_input_data_1,
getter_input_data_2,
getter_output_data_1,
getter_output_data_2],
item_tfms=[tmfs_for_input_data_1,
tmfs_for_input_data_2,
tmfs_for_ouput_data_1,
tmfs_for_ouput_data_2]
n_inp=2)
See this thread to why that can be problematic. There’s a lot of workarounds needed here, as the text transforms/API is not the same as the vision, and tabular is a ballpark of its own:
(Notice instead of dealing with the DataBlock API we instead deal with TabularPandas, as tabular operates with this). This method avoids that ones headache, and thanks to the generic method, requires almost no overhead from the user.
If you can find a more successful route please let me know but I’ve been trying to solve this problem for a few months now and this is what I’ve discovered is the best solution. (And Sylvain agrees too)
I found another solution As the docs suggests, Datasets
could be the solution for multi-modal problems:
Because the fastai2 is very modular you can build very practical (with clear code) datasets Check my toy samoyed dataset:
Create dataset from pandas dataframe
dsets = Datasets(df, [#### Image
[ColReader('Image', pref=img_path, suff='.jpg'),
PILImage.create,
Resize(128),
ToTensor(),
IntToFloatTensor()],
#### NumVar
[ColReader('NumVar'), RegressionSetup()],
#### CatVar
[ColReader('CatVar'), Categorize(), OneHotEncode()],
#### MultiCatVar
[ColReader('MultiCatVar', label_delim=' '), MultiCategorize(), OneHotEncode()]
],
splits=#You can specify train/val split here
)
Access 1st item
dsets[0]
(TensorImage([[[0.8196, 0.8275, 0.8118, ..., 0.5333, 0.5569, 0.5373]]]),
tensor(1.1000),
TensorMultiCategory([1., 0., 0.]),
TensorMultiCategory([0., 1., 0., 1.]))
Create dataloader & show batch
dls = dsets.dataloaders(bs=4)
dls.show_batch()
Are those not all y’s you have there though besides the image input? If they weren’t, your cont vars need to be normalized too, and your cat vars need to be converted to integers with potential FillMissing, aka the entirety of TabularPandas. They have their own seperate preprocessing you need to take into account
And preprocessing everything entirely beforehand isn’t very efficient, and gets rid of batch and item transforms
(Though we may discover a few ways for doing this task, so keep at it )
@muellerzr Thanks for finally doing this… i’ve asked several times if this was possible but got no response or comment from anyone
I also sent you DM’s and got no replies…
This is exiting, i’ll be looking forward to the end to end example.
the line b = next(iter(self.dls[key])) returns a TfmDL object for me which is not subscriptable and thus returns an error, although I pass Dataloader objects into the function. Im a little confused.
Need to know a bit more about what you’re doing to help. Are you passing in one DL? Or multiple DataLoaders to MixedDL
Sorry, so this is what im passing in:
def get_only_lateral_studies_data_loader(df_path):
df = pd.read_csv(df_path)
train_df = df.loc[(df['valid'] == False) & (df['Lateral'] != 'black.jpg')]
valid_df = df.loc[(df['valid'] == True) & (df['Lateral'] != 'black.jpg')]
train_df.reset_index(inplace=True)
valid_df.reset_index(inplace=True)
train_tl= TfmdLists(range(len(train_df)), StudyTransform(train_df))
valid_tl= TfmdLists(range(len(valid_df)), StudyTransform(valid_df))
dls = DataLoaders.from_dsets(train_tl, valid_tl, shuffle=True,
after_item=[ToTensor],
after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats), *aug_transforms()])
dls = dls.cuda()
return dls
def get_only_frontal_studies_data_loader(df_path):
df = pd.read_csv(df_path)
df = df.loc[df['Lateral'] == 'black.jpg']
df[target_label[0]] = df[target_label[0]].astype(bool)
return ImageDataLoaders.from_df(df=df, path=path, fn_col='Frontal', shuffle_train=True, valid_col='valid', label_col=target_label, batch_tfms=aug_transforms())
dls_lateral = get_only_lateral_studies_data_loader(df_path)
dls_frontal = get_only_frontal_studies_data_loader(df_path)
dls_mixed = MixedDL(dls_lateral, dls_frontal)
And this is the Stack Trace:
<ipython-input-102-185f93ba75ea> in __init__(self, device, *dls)
14 self.count = 0
15 self.fake_l = _FakeLoader(self, False, 0, 0)
---> 16 self._get_idxs()
17
18 def __len__(self): return len(self.dls[0])
<ipython-input-102-185f93ba75ea> in _get_idxs(self)
36 for key, n_inp in dl_dict.items():
37 b = next(iter(self.dls[key]))
---> 38 inps += L(b[:n_inp])
39 outs += L(b[n_inp:])
40 self.x_idxs = self._get_vals(inps)
TypeError: 'TfmdDL' object is not subscriptable
You need to pass in the individual train/valid DataLoaders separately. IE
mixed_train = MixedDL(lateral[0], frontal[0])
mixed_valid = MixedDL(lateral[1], frontal[1])
And then:
dls = DataLoaders(mixed_train, mixed_valid)
Let me know if that solves your issue @NimaC
Edit: ah, I did not mention this in the thread so far! Apologies! (BTW will be moving this over to walkwithfastai.com this week, so it’ll be a more flushed out tutorial ) I’ll likely make a helper function to do this as well.
Hey Zach, did you end up moving this over? Could you point me to where in the repo/website?
Thanks!
Hi @muellerzr thank you for your advices here, I am a new user fast ai library and what I did is build a combining model image and tabular data and already the model is trained. and now I want to predict single record from test data frame, I used this method to integrate input image and tabular data
integratedata,_=get_imagetabdatasets(test_image,tab_data)
and data format of
integratedata[0]
is ((Image (3, 128, 128), TabularLine [tensor([2]), tensor([-0.6136])]),
EmptyLabel 0)
and when I called
learn.predict(integratedata)
the error was: ‘ImageTabDataset’ object has no attribute ‘set_item’ , so what should I do to infer single input or single record from data frame. I hope clear on my question.
I used this notebook as a reference https://github.com/naity/image_tabular/blob/master/siim_isic_integrated_model.ipynb
Hi all!
I am using the MixedDL to combine Tabular and NLP.
mixedDL1 = MixedDL(self.tab_dl[0], self.nlp_dl[0])
mixedDL2 = MixedDL(self.tab_dl[1], self.nlp_dl[1])self.dls = DataLoaders(mixedDL1, mixedDL2)
I am using MixedDL
class with one_batch
function that defines @muellerzr:
def one_batch(self): "Grab one batch of data" with self.fake_l.no_multiproc(): res = first(self) if hasattr(self, 'it'): delattr(self, 'it') return res
But when I run this function I get the following error:
File "/home/admin/PycharmProjects/tabular-nlp/tabular_nlp/concat_model/concat_pipeline.py", line 318, in create_databunch
batch = mixedDL1.one_batch()
File "/home/admin/PycharmProjects/tabular-nlp/tabular_nlp/concat_model/concat_pipeline.py", line 93, in one_batch
res = first(self)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastcore/basics.py", line 547, in first
return next(x, None)
File "/home/admin/PycharmProjects/tabular-nlp/tabular_nlp/concat_model/concat_pipeline.py", line 77, in __iter__
z = zip(*[_loaders[i.fake_l.num_workers == 0](i.fake_l) for i in self.dls])
File "/home/admin/PycharmProjects/tabular-nlp/tabular_nlp/concat_model/concat_pipeline.py", line 77, in <listcomp>
z = zip(*[_loaders[i.fake_l.num_workers == 0](i.fake_l) for i in self.dls])
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 552, in __init__
self._dataset_fetcher = _DatasetKind.create_fetcher(
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 51, in create_fetcher
return _utils.fetch._IterableDatasetFetcher(dataset, auto_collation, collate_fn, drop_last)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 21, in __init__
self.dataset_iter = iter(dataset)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastai/data/load.py", line 30, in __iter__
def __iter__(self): return iter(self.d.create_batches(self.d.sample()))
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastai/data/load.py", line 103, in sample
return (b for i,b in enumerate(self.__idxs) if i//(self.bs or 1)%self.num_workers==self.offs)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastcore/basics.py", line 388, in __getattr__
if attr is not None: return getattr(attr,k)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastcore/basics.py", line 388, in __getattr__
if attr is not None: return getattr(attr,k)
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastcore/transform.py", line 204, in __getattr__
def __getattr__(self,k): return gather_attrs(self, k, 'fs')
File "/home/admin/.virtualenvs/tabular-nlp/lib/python3.8/site-packages/fastcore/transform.py", line 162, in gather_attrs
if k.startswith('_') or k==nm: raise AttributeError(k)
AttributeError: _DataLoader__idxs
Could someone tell me where this error comes from? Or how can i fix it?
Thanks in advance!
Can you share your full MixedDL code with me that you are using?
Yes, this is the full MixedDL
code that I am using:
class MixedDL:
def __init__(self, tab_dl: TabDataLoader, nlp_dl: DataLoaders, device="cpu:0"):
"Stores away `tab_dl` and `vis_dl`, and overrides `shuffle_fn`"
self.device = device
tab_dl.shuffle_fn = self.shuffle_fn
nlp_dl.shuffle_fn = self.shuffle_fn
self.dls = [tab_dl, nlp_dl]
self.count = 0
self.fake_l = _FakeLoader(self, False, 0, 0, 0)
def __len__(self):
return len(self.dls[0])
def shuffle_fn(self, idxs):
"Generates a new `rng` based upon which `DataLoader` is called"
if self.count == 0:
self.rng = self.dls[0].rng.sample(idxs, len(idxs))
self.count += 1
return self.rng
else:
self.count = 0
return self.rng
def to(self, device):
self.device = device
def __iter__(self):
"Iterate over your `DataLoader`"
z = zip(*[_loaders[i.fake_l.num_workers == 0](i.fake_l) for i in self.dls])
for b in z:
if self.device is not None:
b = to_device(b, self.device)
batch = []
batch.extend(self.dls[0].after_batch(b[0])[:2])
batch.append(self.dls[1].after_batch(b[1][0]))
try:
batch.append(b[1][1])
yield tuple(batch)
except:
yield tuple(batch)
def one_batch(self):
"Grab a batch from the `DataLoader`"
with self.fake_l.no_multiproc():
res = first(self)
if hasattr(self, "it"):
delattr(self, "it")
return res
def show_batch(self):
"Show a batch from multiple `DataLoaders`"
for dl in self.dls:
dl.show_batch()
plt.show()
Using this I mixed tabular and nlp dataloaders:
mixedDL1 = MixedDL(self.tab_dl[0], self.nlp_dl[0])
mixedDL2 = MixedDL(self.tab_dl[1], self.nlp_dl[1])
Where self.tab_dl
is a TabularDataLoaders
, self.tab_dl[0]
is a TabDataLoader
, self.nlp_dl
is a DataLoaders
and self.nlp_dl[0]
is a SortedDL
.
Hi @muellerzr Zack, fascinating works! Do you have a colab notebook/githut repo to test out this hybrid model? It’s easier for me to follow if you have some sample dataset to play with.
Hello @Saioa, Glad to see your experiment on the tab+text hybrid! Do you have some update on your experiment? I have a application case want to test out this hybrid approach.
Sadly I do not, the data was proprietary:( but we can debug anything you’re working on together
Hi @wjlgatech!
No, I didn’t make any more progress on the hybrid model. In the problem I was facing it was enough to add the loss of the NLP model as a column for the tabular model. And that’s how we solved the problem.
Still, at some point I want to go back to this code, so any progress you make on this way keep me up to date.