A unet learner is for segmentation, which needs a mask of every pixel with its corresponding class. You seem to be wanting to make object detection instead, as you have bounding boxes coordinates. There are plenty of topics in this forum discussing it such as Working notebook for Object Detection.