How to best integrate both image and tabular data in a single model using fastai?

Suppose that for each example, we have both image and tabular data. For the image, we can use a CNN-based model, and for the tabular data, we can use embeddings and full-connected layers. With fastai, it is easy to build two separate models for each type of data.

But what if we want to build a single model? My thought is that, for each sample, we load its image and tabular data and feed them to a CNN and fully connected neural network, respectively. We then concatenate the outputs and feed them into another fully connected neural network to generate the final outcomes.

I was wondering how to best implement this using fastai. I imagine that we first need to build a custom Data type or DataBunch that can load both image and tabular data and then a model that integrates three neural networks. I would love to hear your thoughts. Thanks!

10 Likes

I think this is a great question - there are many circumstances where I think we’d want to merge structured metadata with an image or text analysis. For example, in the recent Gaggle Quick Draw competition, it could have been useful to take country into account along with the image drawn, since people from different regions around the world may have different interpretations of a word (e.g., “football”)

I think the approach you suggest makes sense - I don’t know how to do it, but I bet it can be done.

1 Like

Hi @ytian, hope you’ve been able to make some headway here! I’ve also worked on a similar problem and found some other resources that other forums members have shared. In particular, the guide mentioned in this post was extremely helpful.

I’ve created a Github gist that lays out my own steps (although with specifics redacted), and the link is at the bottom of this post. Hopefully this is useful to you!

One of the best things about the Fastai library is that the layered learning rates are absolutely crucial here. For example, you might create a model that has layer groups like:

     ┌───>    conv1 ───> conv2  ───┐
     │                             │
data ┤                             ├───> final ───> output
     │                             │ 
     └─────────────────> tab1 ─────┘

as is the case in my example. The final + tab1 linear layers are grouped together, and so appropriately can be trained using the same learning rate. For the pretrained conv1 and conv2 layer groups, we may want to use a substantially lower learning rate. Anyway, just my two cents.

8 Likes

Awesome! Thank you for sharing your code, and I will take a closer look at it during the weekend.

It is really nice work. I’ve done something similar and I’ve trained my model, but there is a problem that I can’t predict anything different than validation set with it. Do you have a solution for this?

A long overdue follow up on this topic, I have written a package named image_tabular using nbdev for integrating image and tabular data for deep learning and train a joint model using the integrated data. There is also a jupyter notebook demonstrating the entire workflow for the SIIM-ISIC Melanoma Classification. The package was largely inspired by the code of @jwuphysics. Thank you!

Please check out my Medium post for details if you are interested:

4 Likes

Thanks a lot! I have tried adapting your code to my use case. I am using exactly the same architecture you illustrate. However, at fit_one_cycle(), I get the below error:

RuntimeError: stack expects each tensor to be equal size, but got [3, 4000, 6000] at entry 0 and [3, 1053, 1872] at entry 2

I get the same error if I run learn.data.one_batch() where learn is the TabConvLearner, like in your code. Not sure how I may have messed up the batch sizes, any ideas?

Thank you

Hm, does the issue arise in line 65?

x = torch.cat([x_tab, x_img], dim=1)

If so, I can imagine two possible “gotchas”. First, the problem is formulated as a regression task, but this can cause some weird behavior if you are trying to do classification. For the latter task you’ll want to remove label_cls=FloatList in the TabularList.from_df and ImageList.from_df constructors. Second, I didn’t allow lots of flexibility when joining the tabular and image models – you can see that the layers are hardcoded-in for resnet34 and for a tabular model with only a single hidden layer.

I’m not sure how helpful this is, so sorry if that doesn’t mitigate your issue. I’ve been using fastai2 for the past six months (highly recommended!) and this kind of task is easier to do. For fastai version 1, I bet that the image_tabular package written by @ytian (see above) is also much better than my simple Gist. :wink:

What’s probably going on here is if you’re using images, you’re not resizing them all to the same size beforehand via a transform, so it can’t push it all to a batch. (Notice this only has to do with the data, not the model). So ensure your pictures are all the same size (in the API) and you should be ok

Thank you @muellerzr and @jwuphysics for you quick responses.

@muellerzr: So, the image_learner only version of this runs. For sizing, I use the datablock api: .transform(img_tfms, size=size). Do I also need to manually resize all images with PIL beforehand? EDIT: I have manually checked and the numbers in the error are indeed the image sizes (from the image folder), so I think you are absolutely right. So the .transform() for image_data is not working as intended I suppose.

@ jwuphysics: The line 65 you quote (with dim 1) is the same on @ytian’s package, so I suspect it is not the issue. The tabular learner and cnn learner both train OK independently. I have made changes (like FloatList -> CategroyList, and output layer out_feature size, etc.) to adapt the architecture to binary classification. I have also changed the layers (two hidden layers for tab_learner).

I have put my adaptation of John’s code under this link. run_experiment.py just calls the tabconvlearner.py.