[Q] Best way to preprocess images+text sequences for feeding into image captioning model?

This is my first time doing IC modeling. I want to extract images + corresponding text sequences, but I don’t know what is the best way to organise those files for training an image captioning model?