Serializing Classifier and Regressor heads in Yolo models

Hi folks,
I hope you guys are doing great.
Need one help, anyone in research or working in domain of computer vision will be great if they can share their thoughts on the following problem:
As everyone knows Yolo is an object detection algorithm, which works on the base of parallel execution of a regressor (used to give you the bboxes) and a classifier (used to give you the class labels for the boxes).

However, the problem I am having with such a type of execution is the accuracy of predictions of the classifier. So even if Yolo is able to accurately predict the bboxes, however quite a number of times it gets the label wrong, and I am thinking the culprit is the image quality.

Basically, I have an image of size 4000 by 6000 pixels, taken by a good camera. I am currently working with YoloV4, and for training the image is resized to 640 by 640 (as usual because of the limitation of VRAM resource) and also the bboxes are scaled accordingly before training starts. What I feel is that becaue of this even though bbox accuracy is same, sometimes the scale of the objects for classification becomes blurry and model confuses one class with another.

One solution to handle this problem, I had in my mind was to serialize the regressor and classifier instead of parallel processing, and then do the whole object detection in 3 steps:

  1. Get the bbox from the regressor for 640 by 640 image
  2. Scale the bbox to the original image size
  3. Give the original image bbox patch to classifier
  4. Continue the training hoping the results might improve :smiley:

Not sure, if this is complicated way to achieve this, but I want to know can we have a possibility to implement something like this.