Image caption generation with high cardinality


The dataset I am working with has word documents of images and captions of what is going on in the image. These captions vary greatly from “1234 Maple Lane North Elevation” to “2nd Floor drywall damage”, “Typical 2 bedroom apartment kitchen”, “Window repair progress”

I plan on establishing datasets by scraping the word documents for the images and captions. Is this okay to do due to the high degree of cardinality?

Thank you.

Hi Cullen,

What do you mean by “the high degree of cardinality”?


Hi Malcolm,

As an example, classifying an image as either a dog or a cat would be two degrees of cardinality. In my example, each image caption has almost infinite degree of cardinality and I am not sure if I need to take a certain approach or scrub every piece of data to fit into smaller degrees of cardinality.

Have a good weekend,

Hi Cullen,

I am not sure I fully understand your question, but wlll give it a try.

This issue is that there will be a very large number of images and their captions, and generated captions could be almost any phrase. But image captioning is not a classification task. You are not selecting a particular caption out of a fixed set of captions. So the cardinality of the categories does not apply here. I think you are mixing up the nature of two different tasks.

Image captioning combines NLP and image processing. I think the most recent and effective models use Transformers with resnets. There’s even one that does the reverse: caption to image. It’s a very active area of research.

That said, there is already extensive work on image captioning to be found on these forums and by an internet search for “image captioning”. You can probably even find an architecture in PyTorch that is already designed and tested.

HTH, and good luck with your project.