I have a dataset of images and text descriptions that describe a certain object that I want to classify. In my specific case each datapoint is a picture of a house and a paragraph description of that house, and I want to classify whether that house needs renovation (i.e. binary classification).
I was wondering if anyone had any code they could share on how to combine an image learner and a text learner to provide one classification?
I’ve done something like this in Keras before, but it would be great to utilise fastai’s transfer learning capabilities individually for the image and text components.
Hey Edward,
The task that you want to do is called Image Captioning. You can follow the topic on github and find some good repositories from here.
Topic - https://github.com/topics/image-captioning
One good Repo using PyTorch for Image Captioning. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
This usually based on Encoder Decoder models in which images are used to Encode features and the features are then used by decoder models to generate text description of the images.
I would love to help and learn with you.
Thanks for your reply @navneetkrch. But I have edited my question to clarify that I want the text and image to both be inputs to a binary classification problem
@navneetkrch@eric3@bow This has been on the back-burner for me for a while, but I’ve just started to look back into it.
It appears another FastAI student, @etremblay has had some success combining all three of the Vision, Tabular and Text modules, see here. So I am going to begin with this as my framework