Object Detection & Identification

Hi all,

I have been implementing an algorithm for object detection and tracking.

As I wanted to enable object re-identification, I have read some papers, but while watching the lecture in which Jeremy explained style transfer principle, I started to wonder if it would be possible to uniquely identify an object by observing the output of a certain group of kernels or dense layers, without additional mechanisms such as path prediction, explicit pose estimation or measuring distance between objects on consecutive frames.

For example, when I observed the video in which a girl walked around the store, picked the shoes and then sat down to put the shoes on, at some point while sitting, she was classified as a dog (probably because from that camera angle her long hair covered a major part of the body) and therefore after being classified as a person again, in current simple algorithm for people counting that I use, the counter value has been incremented. Anyway, intuitively, the problem seems trivial, the papers I have read so far suggest adding additional complex mechanisms, but perhaps some values before the final layer contain information that is not that much affected by pose in this case (e.g. hair color, skin color, dress type) and could serve well to re-identify the object throughout frame sequence or various cameras.

Has anyone implemented anything that might be relevant to object identification by observing the outputs of the certain layer or noticed any interesting article?

Thank you.

I haven’t seen object trackers that work without specialized algorithms like Kalman filters.

It might be possible to use activations of the previous layer. That would be a 3D tensor (HxWxN where N is a number of feature maps. You could extract feature map values from a cell that correspond to the bounding box where an object was found and treat it as a feature vector. One complication is that it can encode more than one object if multiple objects were detected in the same cell. There could also be some information in the neighboring cells.

If you work with a video, maybe it makes more sense to build a model that gets multiple consecutive video frames as an input and detects objects in the last frame? It seems more straightforward to train and it may produce better results than post-processing of individual frame detections.