I’m thinking of working on a project that involves multiple models of data and wanted to share my thoughts to get some feedback. Think of problem of sentiment classification where the input contains both text and input (maybe Amazon reviews where some users share images of the products). I was thinking of designing a classifier in this way:
- Assuming each review is a datapoint, we represent that datapoint with a tensor that contains the concatenated values representing the image and the text preprocessed appropriately.
- We then input this data into a deep learning model where the tensor is correctly separated and the image part is fed into a CNN and the text part is fed to a pre-trained BERT model.
- The outputs of both the CNN and BERT are then concatenated and fed to a linear classifier with a couple of layers.
- The output of this linear classifier is the prediction of the model from which we derive the loss and the entire network is trained end-to-end.
I’m looking for thoughts and feedback on this task and approach. Specifically,
- Does this approach make sense and do you think its worth trying or is there a better approach that you can point me to?
- One issue I can think for some reviews there might be multiple images. How do I merge these multiple images to a single review to form a datapoint?
- Do you know of any datasets/tasks that conforms to this modality?
- Do you know of any previous work that has tackled this problem?
To point 4, I did some preliminary research and only came across a very limited number of papers:
- Combining textual and visual representations for multimodal author profiling Notebook for PAN at CLEF 2018
- Multimodal deep networks for text and image-based document classification