NLP for Associating Text with Images in a PDF

I am looking at a problem where text described in the beginning of the pdf is describing images in the second half of the pdf.

I am trying to predict what text goes with what image based on the coordinates of the text and image. However, I was wondering about using NLP to also use the text as an additional x variable.

Right now, the way I formatted the problem is a response and variable matrix of:

Y = (Y/N) (does the text associate with the image at position x,y)?

X = (text_pos_x, text_pos_y, text_page_no, img_pos_x, img_pos_y, img_page_no, etc…)

I would like to add the actual text and perhaps the image pixel values, but this would create a very large matrix. Also, a lot of the text is not associated with an image and therefore is quite sparse.

Is this the right way to formulate this problem?

It doesn’t seem to capture the sequential order of both the text and image. For example, if it’s the second image then it’s more likely to be associated with the first image’s text or the preceding text.

Any thoughts would be greatly appreciated.

Thank you,
negodfre