About bounding box localization


I just finished this week an implementation of SSD with Keras.
I can send you the wiki page I made explaining how it is working in details (I need to review it one time to make sure there are no mistakes).

(Edwin) #22

Thanks Ben!

Just found this as well, did a search bbox in the new repo and found this as well.

(Edwin) #23

@jeremy a little bit stuck. How do I look at the bounding predicted bounded boxes after using learn.fit? Also, how do I pass in an individual image using this method.

Tried this method that worked on the classification task, but didn’t seem to work here.

trn_tfms, val_tfms = tfms_from_model(arch, sz)

img = Image.open(PATH+fn)
im = trn_tfms(np.array(img))
preds = learn.predict_array(im[None])
y = np.argmax(preds)

(Edwin) #24
  • Update
  1. The from the learn.predict() method are bounding box positions that come out.
  2. This worked for me as from previous answer on forum to get individual prediction.
io_img = io.imread(img_url)
im = self.trn_tfms(np.array(io_img)/255.0)
preds = to_np(self.learn.models.model(V(T(im[None]))))
  • using it an api so mt look a little different.


Here is the document SSD-Description.pdf (2.4 MB).

Please note that it is not a tutorial on how to implement SSD but a summary of the information I collected while studying the model. I hope it can be still helpful.

Regarding the implementation, I encourage everyone to look at the following repo:

It is extremely useful to understand the details.

(Ching June Hao) #26

Hi there, I would like to ask, as I have implemented SSD too to detect custom object, but how should I crop the image out that detected by the box? Is there a way?


@chingjunehao: The output of the SSD detector is a tensor containing N_box “box tensors”. Each “box tensor” contains the data relative to one box generated by the model, the data being (your data order might be different) [x_center, y_center, width, height, x_center_variance, y_center_variance, width_variance, height_variance, class1_score, …, classN_score, x_center_offset, y_center_offset, width_offset, height_offset].
The scores and offsets are the parameters that SSD predicts and the other parameters are fixed.
So in order to crop your image, you need to follow the steps below:

  1. Have a decoder function to compute the predicted position of the boxes in pixels.
  2. Filter the boxes to keep only the ones with the highest score (non-max suppression)
  3. Finally use the positions of the remaining boxes to crop your image.

(Divya Sivasankaran) #30

Interesting thread!

I’m using a combination of inception + ssd to train my own custom dataset. After weeks of painful annotations, I finally got to a good level of accuracy on the object categories.

Strangely though, I can see a consistent pattern in the bounding box predictions among all classes. It’s like they are all expanded to the right side alone to include extra space, yet a tight bound in the left (esp left -bottom). check out the images below… and I’m not exaggerating - but EVERY prediction comes out this way… so I’m pretty sure it has something to do with my data/ a possible bug.


Just wondering if anyone has faced a similar issue or has any ideas on why this might be happening? Any pointers would be great!

(Alex) #31

Yes, this implementation is amazing. Reading the documentation and the code for the ground truth encoding process helped my understanding a lot. It’s actually a lot better (more comprehensive, better documentation, all original models provided) than the one linked to further up in this thread:

(Pawan) #32

hey @arnaud can you please provide the labelImg and RectLabel functions.

(Morten Punnerud Engelstad) #33

Something like this you are looking for to save part of the image?

from PIL import Image
img = Image.open('saved_image.png')
croppedIm = img.crop((490, 980, 1310, 1217))

(late answer)


Hello @arnaud do you have the function that changes json annotations to pickle ?

(Prabhat) #35

Hello All
So after reading this I tried object localization on my own data containing single object per image without any class i.e. I’ve to predict only bounding boxes of object without classifying them as object are very different. Most of the objects are grocerry item on a white background. I tried adding a simple regressor head ahead resnet and mobilenet but I’m in vain as the results aren’t good. Also, some of my test images contain two adjecent objects like a pair of shoes whereas in my training data there’s only one shoe. So, this is one thing I’m not able to tackle. Can somebody help?