Object Localization without being given labels

Hello people! So I have a problem where given a set of images captured in grocery stores(shelf image), and a set of close up images of products in
those stores(product image)

  1. For every product image, find the location of that product in all shelf images in which it appears.
  2. For every shelf image, locate all products and assign the name from given set of product images.

neither of them are labelled. I feel I need to create vector that represents each proudct by doing an encoder decoder model and generating latent space representations.

however they ALSO want me to locate the product in the shelf, with a bounding box
so in a shelf image, if the shelf has 20 products they want
eg : shelf_image 32 - product_id_21 (known purely using vector similarity) - bounding box

any ideas on how I can do this? Thanks! Im confident of the vector creation part, but not sure how to use that to use a shelf image and create classification AND bounding boxes

product image

shelf image