Tensorflow Object Detection API - add metadata input tensor, e.g. one-hot encoded IDs/Tags


I have had some success implementing Tensorflow’s Object Detection API, and I’m getting some pretty good results from training the SSD Inception V2 architecture using our own dataset, with no finetuning/checkpoints.

I have so far managed to make some modifications to the API and respective pipeline/config (decoder, exporter, feature extractor and so on) to accept our specific data (single channel, very small resolution <= 30x90 images).

However, I’m now struggling to work out how (or if it’s even possible) to add metadata inputs to the API to improve accuracy. I believe that due to the huge variety of data within a particular class, and sometimes the similarity between classes themselves in our dataset, there is some ambiguity, and adding metadata (for example specific IDs/tags) may improve the overall performance, since the network will have context and giving the network context using the metadata I have should help reduce this ambiguity.

To that end, I have some specific questions/ideas:

  1. Is it possible to modify the current input, and add a second one-hot encoded input tensor alongside image_tensor? If so, where should I begin to look?

  2. If not, would it perhaps be possible to modify the image_tensor input itself, and merge the one-hot encoded data, perhaps as a separate pseudo channel?

  3. Would it perhaps be better to build a standalone network from the ground up, and if so, which parts would I need to “extract” from the API code in order to replicate the SSD-Inception V2 model?

The input tensor modification would be required for both training and inference, and no output tensor modifications/additions are required.

In the interest of openness, I have also asked this question on SO: https://stackoverflow.com/questions/47908222/tensorflow-object-detection-add-metadata-input-tensor-e-g-one-hot-encoded-id

Thank you so much for any help/pointers in the right direction.

Have you had a look at https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/defining_your_own_model.md#detectionmodels-object_detectioncoremodelpy


Thanks for the reply. Yes I did read that document, but unfortunately it’s very high level and seems to (understandably) assume that the only input to the feature extractor will be an image tensor. I’m also currently trying to wade through the underlying core code to try and make some sense of how it all knits together, for example the base Inception V2 network architecture: https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_v2.py, and the Mobilenet architecture: https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.py

I guess I’ll possibly have to create a new network architecture base on one of those nets, which accepts one-hot encoded data along with the image tensor, then work up from there. I’m just having difficulty trying to work out how to actually integrate this metadata into the network - for example, whether to create a parallel small network to handle the one-hot encoded data (using a simple conv2d + relu architecture), then merge/concatenate it with the ConvNet.

Another possibility is to perhaps modify the feature map generators: https://github.com/tensorflow/models/blob/master/research/object_detection/models/feature_map_generators.py - but again, the issue of where to actually add the metadata inputs.

From what I understand, the SSD net sits after the ConvNet, so trying to understand where to fit the metadata has become somewhat of a major challenge, especially considering the differing input types.

Thanks again