Classification using both image and text

I have a dataset of images and text descriptions that describe a certain object that I want to classify. In my specific case each datapoint is a picture of a house and a paragraph description of that house, and I want to classify whether that house needs renovation (i.e. binary classification).

I was wondering if anyone had any code they could share on how to combine an image learner and a text learner to provide one classification?

I’ve done something like this in Keras before, but it would be great to utilise fastai’s transfer learning capabilities individually for the image and text components.

4 Likes

Hey Edward,
The task that you want to do is called Image Captioning. You can follow the topic on github and find some good repositories from here.
Topic - https://github.com/topics/image-captioning

One good Repo using PyTorch for Image Captioning.
https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
This usually based on Encoder Decoder models in which images are used to Encode features and the features are then used by decoder models to generate text description of the images.
I would love to help and learn with you.

Thanks for your reply @navneetkrch. But I have edited my question to clarify that I want the text and image to both be inputs to a binary classification problem

This is something I am interested in as well. @ecatkins have you managed to make any progress?

This seems to be the relevant dataset, you will find a lot of kernels as well.


They have provided Image meta data, and text description of the pets that needs to be adopted.
I hope this helps.

1 Like

Hi @ecatkins,

have you made some progress about this topic?. I am interested too.

Thanks

Hey @ecatkins,
I am curious to know about the project? What is the approach that you finally took?
Thanks

@navneetkrch @eric3 @bow This has been on the back-burner for me for a while, but I’ve just started to look back into it.

It appears another FastAI student, @etremblay has had some success combining all three of the Vision, Tabular and Text modules, see here. So I am going to begin with this as my framework