How to combine a deep learning model with a shallow tabular model to utilize two vastly different kinds of feature input to improve modeling accuracy

Loved the course and learned a lot.

One area I would benefit a lot from is combining image, text, and tabular features.

The MVP approach i have tried is to ensemble an image model, text model, and tabular model, and it works, but intuitively I should be able to do way better by combining them in a nonlinear way.

Any common techniques for this?

I’m thinking to do an RF or dense layer or two on the embedding output of all the models if nothing else.

In attempt to convert my problem into an already solved problem and learn from existing techniques, it very much resembles an ad model. So i’m sure there must be plenty of research on this. Any tips/kaggles here?


  • paragraph of text (~1k lines of code in my case)
  • a picture of a product (not a picture in my case but a 2D transform of data that a cnn works decently on)
  • some tabular information about the audience, like male/female, age, etc


  • product they bought (classification)
  • how much they were willing to pay for it (regression)

Don’t worry, my actual topic isn’t advertisements, and there’s no societal bias affect. Just an example. My real problem is a bit more convoluted than this and involves too many technical things that aren’t relevant to getting the ML part off the ground better using the relationships between the inputs.

Ah all I needed to do is search a little harder. Seems the thing I’m trying to do is called a joint model.

And here is a perfect Kaggle example: images of cancer + patient meta data.

1 Like