My goal is to caption images with a sentence. I am curious if there is a way to grade the model instead of if it guessed the caption right or not, to give it a percent grade based on how many words it got correct or in the right order?
As an example, for a certain image input,
It predicted: “The windows are being installed from top to bottom”
But the correct answer was: “The windows are being installed from bottom to top”
From what I have learned this is 100% incorrect but in my use case it save a human a lot of work and just go back and make a quick correction. Instead of just marking it wrong, I am curious if there is a way to derive more details about the response and correct answer.
One approach you can take is to use a similarity metric such as the Levenshtein distance or the cosine similarity to measure the similarity between the predicted caption and the correct answer. These metrics can provide a numerical value indicating how close the predicted caption is to the correct answer.
For example, you can calculate the Levenshtein distance between the predicted caption and the correct answer. The Levenshtein distance measures the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another. A lower distance value indicates a higher similarity.