Hi everyone,
I would like to create an imaging caption model using fastai. I haven’t found too much in terms of resources on doing this except for a short thread from last year so I guess that I’ll be asking a lot around here. It seems like Coco-2014 is the dataset for this, but if anyone has other suggestions that would be helpful. For now I have two questions:
How should I go about doing transfer learning for this? Should I use a regular pretrained cnn/rnn and merge them (I’ll figure out how to about this as I go along) or is there a full network that I can use out of the box for this that has been pretrained?
How should I be loading the data? I have images and a json with captions so how should I go about getting that into a databunch? If there is some tutorial out there that goes through this that would also be appreciated.
I’m working my way through the code there and I am confused as to the values they use for normalizing the images: transforms.Normalize([0.5238, 0.5003, 0.4718], [0.3159, 0.3091, 0.3216])
My understanding is that each value in the first list is the mean of that channel in the image that we would like the normalized image to have. The second list the the standard deviation we would like each channel to have in the normalized image. If this isn’t correct, please let me know.
Where do the actual values come from and how would I compute them myself?