Combining Tabular + Images in fastai2 (and should work with almost any other type)

muellerzr · June 12, 2020, 4:26pm

So just recently (yesterday) I figured out a way to combine Tabular + Images in the fastai2 framework, and this general approach should work with just about any DataLoader, and I’ll try to explain and discuss why here.

~~Caution: So far just works with Tab + Vision, need to figure out why it won’t work for text~~ Can verify it works any DataLoader except LMDataLoaders (as those have their own special bits, etc)

The Pipeline

Here is an outline of how you go about doing this:

Make your tab and vis DataLoaders
(vis = Vision, tab = Tabular)
Combine them together into a Hybrid DataLoader
Adjust your own test_dl framework how you choose
Train

The Code:

Now let’s talk about the code. For our “DataLoader”, it won’t inherit the DataLoader class (hence the quotes around it). Instead we’ll give it the minimal similar behavior to a DataLoader that is needed, and have everything else work internally. Specifically, these functionalities:

FakeLoader
__len__
__iter__
one_batch
show_batch
shuffle_fn
to

Now to build this I’m going to walk us through it with @patch from the fastcore library. Basically this lets us lazily define the class as we go, so don’t get confused to why it’s all in more than one block.

`init` and `FakeLoader`

The __init__ for our model needs to store 5 items, the device we’re running on, our two DataLoaders we’re passing in, a count, a _FakeLoader, and our new shuffle function (for now this will be undefined, we’ll discuss it more in a moment). Also, FakeLoader is used during the __iter__, see the regular DataLoader source code to see it there:

from fastai2.data.load import _FakeLoader, _loaders
class MixedDL():
    def __init__(self, tab_dl:TabDataLoader, vis_dl:TfmdDL, device='cuda:0'):
        "Stores away `tab_dl` and `vis_dl`, and overrides `shuffle_fn`"
        self.device = device
        tab_dl.shuffle_fn = self.shuffle_fn
        vis_dl.shuffle_fn = self.shuffle_fn
        self.dls = [tab_dl, vis_dl]
        self.count = 0
        self.fake_l = _FakeLoader(self, False, 0, 0)

`shuffle_fn`

Now we’ll look at the shuffle_fn there. What needs to have happen? The shuffle_fn returns a list of index’s for us to use, that’s stored inside of self.rng, and we want those index’s to change every 2 times we call the shuffle_fn (as we call it for each of our internal DataLoaders), to ensure that both are mapped out to the same index’s for preparing our batch. This is what that looks like:

@patch   
def shuffle_fn(x:MixedDL, idxs):
        "Generates a new `rng` based upon which `DataLoader` is called"
        if x.count == 0: # if we haven't generated an rng yet
            x.rng = x.dls[0].rng.sample(idxs, len(idxs))
            x.count += 1
            return x.rng
        else:
            x.count = 0
            return x.rng

This is all that’s needed to ensure that all of our batches get shuffled together. And if you’re using more than two, count is just equal to n internal DataLoaders.

While we’re at it, we’ll take care of two other functions, the __len__ attribute and the to function. __len__ just needs to grab the length of one of our DataLoaders, and to just returns the name of our device:

@patch 
def __len__(x:MixedDL): return len(x.dls[0])

@patch
def to(x:MixedDL, device): x.device = device

`iter`

Now let’s move into something a bit more complex, the iterator. Now, our iterator needs to take all of our batches from our loaders and perform the after_batch transform for those outputs from their respective DataLoader before finally being put into a batch, also moving each to the device. While this may look scary, the _loaders etc is all the same as it is from the DataLoaders class, so it’s just how we access them:

@patch
def __iter__(dl:MixedDL):
    "Iterate over your `DataLoader`"
    z = zip(*[_loaders[i.fake_l.num_workers==0](i.fake_l) for i in dl.dls])
    for b in z:
        if dl.device is not None: 
            b = to_device(b, dl.device)
        batch = []
        batch.extend(dl.dls[0].after_batch(b[0])[:2]) # tabular cat and cont
        batch.append(dl.dls[1].after_batch(b[1][0])) # Image
        try: # In case the data is unlabelled
            batch.append(b[1][1]) # y
            yield tuple(batch)
        except:
            yield tuple(batch)

Notice the device is adjusted recursively before we move to the batch transforms (this is how fastai moves them all to the GPU)

`one_batch`

Alright, so we can build it, iterate it, now how do we get our good ol’ fashion one_batch? Quite easily. We call fake_l.no_multiproc() (which so you know, that means we temporarily adjust the num_workers in our DataLoader to zero) and grab the first batch, while also discarding any iterators the DataLoader may have (as first calls next(iter(dl))):

@patch
def one_batch(x:MixedDL):
    "Grab a batch from the `DataLoader`"
    with x.fake_l.no_multiproc(): res = first(x)
    if hasattr(x, 'it'): delattr(x, 'it')
    return res

You may or may not get an exception error, (Sylvain if you’re reading this, it’s:

Exception ignored in: <generator object MixedDL.__iter__ at 0x7f75b31d0cd0>
RuntimeError: generator ignored GeneratorExit

) However this can be ignored I’ve found, as all your data will be returned. Your batch now returns as [cat, cont, im, y]

`show_batch`

Next up is probably the easiest out of all of the functions. All we’re wanting to do here is in each DataLoader, call show_batch. It’s as simple as it sounds:

@patch
def show_batch(x:MixedDL):
    "Show a batch from multiple `DataLoaders`"
    for dl in x.dls:
        dl.show_batch()

For an example output, here is one for a recent (and ongoing) kaggle comp:

And that’s all that’s needed to start training and have all the functionalities of fastai while bringing in the various DataTypes. So they key that made this entire thing possible is due to how fastai does the shuffle_fn, and the fact they are indices.

`test_dl`

The last thing I’ll show is how to do the test_dl. When you’re making these ideally you build the entire Image and Tabular dls, which gives you access to the .test_dl function. From there, simply do something like:

im_test = vis_dl.test_dl(test_df)
tab_test = tab_dl.test_dl(test_df)
test_dl = MixedDL(tab_test, im_test)

And you’re good to go! The main reason we don’t have to worry about enabling shuffling, etc is due to the fact it’s done on the interior DataLoader level.

I hope this helps you guys, let me know if there are any questions! (Or recommendations on how to improve this method further)

muellerzr · June 13, 2020, 4:14am

Some quick gotcha’s with this (I’ll update this as I find them):

Pay close attention to ensure you’re grabbing your y’s properly. For instance, if my y’s were linked with the Tab DL, I could also write it in my __iter__ function like so:

batch.extend(tab[:2]) # tabular cat and cont
batch.append(self.dls[1].after_batch(b[1][0]))
batch.append(tab[2]) # y

Joan · June 13, 2020, 8:55am

This is awesome!

Quick and I guess naive qüestion. Since this is a mixed dataset which would be your go-to architecture, loss, etc to train this model?

vrodriguezf · June 13, 2020, 1:59pm

Amazing!!! One question, what’s the difference in your code between MixedDL and HybridDL?

muellerzr · June 13, 2020, 2:07pm

Me not checking what I’m calling things (I’ve fixed this)

muellerzr · June 13, 2020, 2:07pm

That I do not know

I’d start by looking at some Kaggle competitions for the various techniques they use. First one on the top of my head is the pet adoption one

vrodriguezf · June 13, 2020, 3:05pm

It would be awesome to see an end 2 end example with one of these mixed dataloaders!

muellerzr · June 13, 2020, 3:14pm

I’ll do one for the pet adoption Kaggle kernel when I have time, may be the best approach.

mrfabulous1 · June 14, 2020, 10:38am

Hi muellerzr hope all is well!
Great work!
mrfabulous1

rbunn80130 · June 20, 2020, 7:40pm

This is really great work! I’m struggling with creating a learner object with the melanoma classification example though. I’m looking forward to trying this method out on a lot of different datasets. Hopefully, this can end up in the fastai2 library.

muellerzr · June 20, 2020, 7:42pm

It can’t really, and I’ll try to explain why:

The methodology for the DataLoader varies upon what your combination results in. Specifically in the iterator. This particular function needs specific modification based on what your input is and expected to be, hence the guide

It’s moreso the technique rather than the code itself

bwarner · June 23, 2020, 7:58pm

Were you able to train with this setup? I have the MixedDL setup, one_batch is working, etc. But when attempting to train or run the learning rate finder I have the following error message.

/opt/conda/lib/python3.7/site-packages/fastai2/learner.py in _do_epoch_train(self)
    173         try:
--> 174             self.dl = self.dls.train;                        self('begin_train')
    175             self.all_batches()

AttributeError: 'MixedDL' object has no attribute 'train'

I think its because there is not a MixedDL.train and MixedDL.valid set anywhere. At first glance at the fastai2 codebase I’m not sure how to set it yet. Still investigating.

muellerzr · June 23, 2020, 8:04pm

Ahhh I see the confusion. The MixedDL is one DataLoader. so you need to make one for your train and your valid and then wrap them inside a DataLoaders instance, IE:

dls = DataLoaders(mixedDL1, mixedDL2)

(I wasn’t very explicit about that, my apologies )

bwarner · June 23, 2020, 11:30pm

Should have thought of that.

Thanks.

bwarner · June 25, 2020, 3:37pm

I think the GeneratorExit error comes from the try-except block in __iter__. Removing the try-except block prevents the GeneratorExit error.

@patch
def __iter__(dl:MixedDL):
    "Iterate over your `DataLoader`"
    z = zip(*[_loaders[i.fake_l.num_workers==0](i.fake_l) for i in dl.dls])
    for b in z:
        if dl.device is not None: 
            b = to_device(b, dl.device)
        yield tuple([*dl.dls[0].after_batch(b[0])[:2], *dl.dls[1].after_batch(b[1])])

I think this should work for a test dataset too, I passed an unlabeled dataset to dl.dls[1] and one_batch returned a tuple of length three as expected.

If not, or if you need to do something a little more complicated you can inherit and create a TestMixdDL.

For example I’m currently concatenating multiple labels into one tensor, so the yield line looks like this:

yield tuple([*dl.dls[0].after_batch(b[0])[:2], dl.dls[1].after_batch(b[1][0]),torch.stack(b[1][1:],dim=1)])

and then I have a TestMixedDL that doesn’t return any labels like so:

class TestMixedDL(MixedDL):
    def __init__(self, tab_dl:TabDataLoader, vis_dl:TfmdDL, device='cuda:0'):
        super().__init__(tab_dl, vis_dl, device)

@patch
def __iter__(dl:TestMixedDL):
    "Iterate over your `DataLoader`"
    z = zip(*[_loaders[i.fake_l.num_workers==0](i.fake_l) for i in dl.dls])
    for b in z:
        if dl.device is not None: 
            b = to_device(b, dl.device)
        yield tuple([*dl.dls[0].after_batch(b[0])[:2], dl.dls[1].after_batch(b[1][0])])

I have a separate issue which occurs whether I use the above code or the original with the try-except block. After a period of time, whether it’s a handful of batches or immediately at the start of the second epoch, memory will explode and I will get a CUDA out of memory error. It doesn’t seem to matter how low the batch size is and has happened with a batch size of two.

Cuda Error

Anyone else running into this issue?

muellerzr · June 25, 2020, 3:50pm

I haven’t quite seen that yet with my text+tab experiments, but what are you using to track the memory usage?

Also how many num_workers are you running?

bwarner · June 25, 2020, 4:05pm

My images are large: 3D and each is 14mb on disk, which might have something to do with it. I’m using the same number of workers as CPU cores, so that varies on the machine I am using.

The chart is from the Rapids GPU Dashboard:

But the same thing is visible when polling nvidia-smi (both when running in a Jupyter Notebook or JupyterLab), or monitoring the Kaggle GPU usage chart.

It also happens across PyTorch 1.4, 1.5, and 1.5.1.

mrdbarros · June 28, 2020, 12:40am

Hi @muellerzr! Awesome work, I’ll use this on my next project!

I’d like to make a small contribution for learn.show_results() to work:

@patch
def show_results(x:MixedDL,b,out,**kwargs):
    for i,dl in enumerate(x.dls):
        if i == 0:
            dl.show_results(b=b[:2]+(b[3],),out=out,**kwargs)
        else:
            dl.show_results(b=b[2:],out=out,**kwargs)

@patch
def new(x:MixedDL,*args,**kwargs):
    new_dls = [dl.new(*args,**kwargs) for dl in x.dls]
    res=MixedDL(*new_dls)
    return res

Thank you for sharing this!

Elliotben · July 7, 2020, 9:00pm

Hey this is super helpful!

In terms of training a model with tabular data and images as inputs I guess the two naive routes that I see would be:
-use the embeddings of the vision model as features for the tabular model and train using the tabular model
-use the features from the tabular data as inputs to a vision model that we would most likely concat at the end of the model architecture and train using that modified vision architecture

Am I missing something? Has that been done before? I’ve seen a few threads on how to merge tabular and vision in a smart way, but still not sure of the SOTA way of training such a model.

Thanks!
Elliot

muellerzr · July 7, 2020, 9:13pm

That’s pretty much it. There was a paper published by Facebook about the struggles and how to train such ensemble-based models appropriately, I haven’t looked into it at all yet though.

Combining Tabular + Images in fastai2 (and should work with almost any other type)

The Pipeline

The Code:

__init__ and FakeLoader

shuffle_fn

__iter__

one_batch

show_batch

test_dl