Multimodal end-to-end deep learning

I’m thinking of working on a project that involves multiple models of data and wanted to share my thoughts to get some feedback. Think of problem of sentiment classification where the input contains both text and input (maybe Amazon reviews where some users share images of the products). I was thinking of designing a classifier in this way:

  1. Assuming each review is a datapoint, we represent that datapoint with a tensor that contains the concatenated values representing the image and the text preprocessed appropriately.
  2. We then input this data into a deep learning model where the tensor is correctly separated and the image part is fed into a CNN and the text part is fed to a pre-trained BERT model.
  3. The outputs of both the CNN and BERT are then concatenated and fed to a linear classifier with a couple of layers.
  4. The output of this linear classifier is the prediction of the model from which we derive the loss and the entire network is trained end-to-end.

I’m looking for thoughts and feedback on this task and approach. Specifically,

  1. Does this approach make sense and do you think its worth trying or is there a better approach that you can point me to?
  2. One issue I can think for some reviews there might be multiple images. How do I merge these multiple images to a single review to form a datapoint?
  3. Do you know of any datasets/tasks that conforms to this modality?
  4. Do you know of any previous work that has tackled this problem?

To point 4, I did some preliminary research and only came across a very limited number of papers:



I think this approach makes a lot of sense. There are some resources. E.g. You can very easily train a multi modal model with autokeras.

You could also just build up a two seperate models, eg. with Keras, and then use a concatenation layer, to combine them and add a classification/regression head.

Yeah I think that makes sense generally. Concatenation is a good baseline, however there are other fusing mechanism that you can you use too like a BiLinear layer.

I don’t work in exactly the same area but one of the things I’m researching is how to fuse meta-data (including images and text) with temporal data in order to improve time series forecasts and classification. So far I’ve generally used concatenation and bilinear layers. Anyways if you want to chat feel free to reach out to me.