Using meta-data with object detection in training only

I am using RetinaNet for binary object detection using Christian Marzahl fastai object detection repo. I ran the model and got a decent pascal voc. However, there were some objects that the model is getting wrong as false positives. I used various transformations etc, but I still get the same mistakes. The reason for that probably is because of the low number of images containing misleading objects or having low quality that leads to the mistakes.
Unfortunately, the dataset that I have collected consists of 3000 images is hard to expand (medical images). And the misleading objects, although are rare, the are inevitable in real life uses of model so I cannot simply delete and trim my data.

Therefore, I have created a CSV table containing most of the parameters that I thought would be misleading for the model based on the validation results, created a table and manually entered the existence or the degree of each parameter for each image as: image quality: 1 or 2 or 3 (for each image) or existence of “x” 0 or 1 (true or false for each image).

Now I have three potential ways to benefit from the CSV:

  1. Use a customised datablock and create a flat layer, probably in the classifier layer group of the model, to take the data for each image and train the weights. This sound very interesting, however, I am using the data that I have manually made just for the training and in test or real life uses of this model, such data would not be available and the input would be just images. So even if I declare the data input as optional, I am not sure whether the model would perform well in testing after production when there is no such meta-data. right?

  2. Append the data to the image and not changing the model. This possibly solves the previous option’s problem for test images, however, it might introduce unhelpful noise and not perform well in testing.

  3. Using data to split a test set from training set so that images with specific criteria are distributed in all sets proportionally equal. The problem with this is that there might be even less than 20 images that have a specific rare parameter and also the complexity of the decision algorithm I have to write to do a multi-criteria decision making to split images.

What do you think forum? I do not have experience in using the first two solutions and therefore, not sure even if they can work for my case at all (use of meta-data for training a model that would not have any meta-data for testing or deployment).