Resources for generating text captions (sentence length) from images?


I have gone through the summaries to try and find if Jeremy talks about generating text from an image. The closest I could find was the CLIP Interrogator in lesson 12. Simple image captioning does not work with my use case because I am trying to caption a photo with a sentence in length of what is occurring in the image, arbitrarily high cardinality. While CLIP Interrogator is a better application than captioning, it still is not enough.

Can what we learn in part 2 be applied to generate captions? Do you have any resources for learning how to do so?

Thank you

There’s a competition happening on Kaggle where the goal is to predict the prompt of an image generated by stable diffusion. You may find some pointers in the discussions happening there.

I’m only on lesson 10 of the course so far, so I don’t know too much, but I suppose certain components of the diffusion pipeline could be put into reverse? An image is input and the model instead has to predict the feature vector of the caption that describes the image. The model would learn the appropriate length of the vector/caption during training.

Thank you. Unfortunately the captions are specialized for an industry. I trained a custom model but I am now having trouble with CORS errors (that I had already fixed) while hosing an endpoint for inferencing.