👐 Training a Network to Detect Hands

Hello everyone! My name is Thomas, I am an independent game developer, currently based in Berlin. I have previously built a game named Kingdom, and am now working on a few smaller projects including a public space (outdoor) game. For an impression of what that is like, this is a game I did previously.


For games like the above, I am currently trying to train a network that would be able to detect hands in (close to) real time from a webcam feed. What I was hoping to do is train the output network to emit high values for regions that contain hands, I will post-process that with a non-maximum suppression / mode-finding to find the actual location.


I have recorded and annotated some data of myself waving my hands around. This data is the source for generating training targets. I have currently opted for a simple blob around the hand center with a radius corresponding to the hand size. I recorded some data with me sitting in front of the webcam, and some data from further away. I am planning to record much more data, but before investing a lot of time in that, I want to have a better grasp of the requirements, so I started experimenting with a quick and dirty dataset first.

Fig 1: Examples of annotated data:

Fig 1

Fig 2: Example of target image with blobs where the hands are:

Fig 2


To start simple and naively, I took the first two Conv Blocks with weights from VGG16, and added another trainable Conv Block on top of that, finishing with a 1x1 Conv2D layer to collapse all the filter weights into a 2D output image.

conv2d_1 (Conv2D)            (None, 64, 240, 426)
conv2d_2 (Conv2D)            (None, 64, 240, 426)
max_pooling2d_1 (MaxPooling2 (None, 64, 120, 213)
conv2d_3 (Conv2D)            (None, 128, 120, 213)
conv2d_4 (Conv2D)            (None, 128, 120, 213)
max_pooling2d_2 (MaxPooling2 (None, 128, 60, 106)
conv2d_18 (Conv2D)           (None, 512, 60, 106)
conv2d_19 (Conv2D)           (None, 512, 60, 106)
conv2d_20 (Conv2D)           (None, 512, 60, 106)
max_pooling2d_8 (MaxPooling2 (None, 512, 30, 53)
conv2d_21 (Conv2D)           (None, 1, 30, 53)

Fig 3: Using this method on the ‘nearby’ dataset, the results are actually decent:
Fig 3

It looks like the network is really marking the hands, and the output would certainly be good enough to localise hands with a proper mode-finding method. I am also excited that this works with relatively few layers, increasing the possibility of running this in real-time.

Question: What is an appropriate loss function?

For the nearby dataset, a significant portion of the pixels in the image consists of hands, but for the ‘faraway’ data, the distribution is actually really skewed. See Fig 2, only a few pixels carry positive values.
What is an appropriate loss function for this skewed distribution of the target data as above? I.e. a black (zero) image with sparse white (positive) areas. Since only a few pixels have positive values, I feel like MSE will encourage the network to generalise and output low values everywhere.

The second part of the question is about “ambiguous” areas. In annotating the data, I have distinguished between “Open and Facing” hands, and “Other hands” (hands that are far away, fists, or other angles, see image below).

Fig 4: “Other Hands”
Fig 4

I do not necessarily need the network to recognise these, but I also do not want to ‘punish’ the network for highlighting them. I suspect that training the network will be easier if the network is allowed to also give (smaller) positive outputs for Other Hands, because they will carry some of the features that also characterise Open Hands (e.g. skin color). In the final network output, I am be satisfied if the Open Hands just light up more brightly. Since I have already distinguished the classes in my dataset, I figured I could somehow tell the optimiser that it doesn’t matter what it outputs for these “Other Hands”, as long as it gets the Open Hands right.


Very cool project!! :blush:

Your current approach is interesting - is it on github / would you mind sharing the code?

Personally I’d approach it as a regression problem where you try to predict the (x,y) co-ordinates of each hand. There’s also a large amount of literature on pose-estimation that you should check out. I think the cutting edge approach is Mask-CNN.

@brendan had a forum thread about implementing it here (not sure how far they got):

You should also check out lecture 11 of 2017’s CS231n:

Good luck!

I have looked at some of those papers, it seemed to me that those multi-stage architectures (candidate selection + classification) might be too slow for real time. I only need to identify a single object type, so I figured I could just directly classify each pixel. Isn’t it elegant to perform the convolutions on the whole image, instead of overlapping sub-windows?

Human pose estimation via Convolutional Part
Heatmap Regression
does something similar for each of the body parts contributing to the pose.

I believe the innovation from R-CNN to Fast R-CNN is to do convolutions over the whole image then slice into the region of the convolution (as opposed to redoing the convolutions for each window). That’s fairly fast but I believe the bottleneck was then a traditional image-processing region proposal method which was replaced in Faster R-CNN by an end-to-end region proposal network.

I think Faster R-CNN is probably fast enough for real-time. Or you could try train a YOLO network and annotate the hands? Think YOLO is definitely fast enough for real-time:


BTW please can you post the code for your approach - keen to check it out if you’re open to posting it :blush:

Ah yes, I remember reading that! It makes a lot of sense to at least combine the convolutions for the whole image. :thinking: Though I still like the idea of combining “what” and “where”, by letting VGG immediately output an “objectness” for each pixel (like in this paper: Fully Convolutional Networks
for Semantic Segmentation
) Especially since I only have one object class!

So to reiterate my initial question, how do you think I should approach the loss function?

EDIT: I’ve extracted the relevant parts from some python files and a Jupyter notebook: :grin: https://gist.github.com/noio/ac64dcdaf51104677cd628189c98299e
As you can see, it’s really just a few layers of VGG and then a Conv Block on top, trained in the most naive way.

I’ve done some work with Open Pose for a client recently. Open Pose is way overkill for what you’re trying to do, but is similar in some ways. It computes heatmaps for body parts and at the same time some other stuff that relates these body parts to each other. In the paper they mention using “an L2 loss between the estimated predictions and the groundtruth maps”.

They also add a binary mask W to the loss function, “with W§ = 0 when the annotation is missing at an image location p. The mask is used to avoid penalizing the true positive predictions during training.”

The paper is Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields and the loss function is explained in sections 2.1 and 2.2

1 Like

Oh wow, OpenPose is really cool! (What am I even doing! :wink: ) Those “Part Affinity Fields” are really smart. Though it might be overkill for my envisioned task, it is probably also a lot more robust. I will think about being pragmatic and employ their solution :older_man: .

:pray: Thanks for your hint regarding a Masked Loss Function. That is exactly the thing I was looking for.

For my own education in the mean time, I have another question:

:arrow_right: Could the scaling of the target data have an influence on training performance?

I discovered that I was not rescaling my output (Fig 2), and as such it had values between 0-255. Someone else recommended I used binary crossentropy, so I figured I would have to scale the output to 0 - 1.0. It seems that this change has really impacted training performance. (Without changing the loss function!) The loss immediately jumps to a low value and stays there for the rest of training, while the network doesn’t learn much. You can see that the predicted output is empty on the bottom right of the image below.

Fig 5: Learning does not progress when rescaling the target images to a 0 - 1.0 range

Note that Open Pose does not detect the positions of the hands, only of the wrists. If that is good enough for you, you could take their trained network and simply throw away all the layers and filters that compute the other things.

I think that the idea of the “Convolutional Pose Machine” used in OpenPose and its predecessors is that the confidence maps for the different joints help the network attain more certainty. E.g.: a wrist is more likely to be near an elbow than near a foot. So my guess is that throwing away the subnets and confidence maps for “everything but wrists” would be detrimental (or catastrophical) for performance.

You can try looking at SSD architecture or mobilenet. If you are looking to detect moving hands , you could perhaps just extract the moving areas in the frame , and use a shallow, fast cnn to classify if it is a hand. Using a neutral network only approach will probably be a little slow

re:extracting areas: I would prefer to stick with my current one-pass method because of simplicity and possibly even speed. Output (as shown below) is getting pretty decent for what I need, and might already fall in the category of “Shallow, Fast CNN”. (With 10 Layers?)

re:MobileNet: What exactly is the core concept behind MobileNets? It is to structure the network in a way that is easy to process on a CPU? At first glance: Does that mean only 3x3 “depth separable” convolutions? I might be able to reshape my current network to satisfy such criteria. It is currently built on a few layers of VGG, but could be built on a few layers of MobileNet?

re:motion I have thought about feeding the optical flow into the network, to assist it with a prior (moving parts are more likely to be hands). I also thought about feeding it with the previous hand locations as a prior. But those are all just bells & whistles.

Fig 6: Results are very decent. Architecture is still quite simple

You can use MobileNet to replace your VGG layers. It’s a lot smaller and faster and has the same accuracy. That’s all, you can just think of it as a faster version of VGG. Of course, you’ll have to retrain your network when you do this since it uses totally different weights.

(MobileNet+SSD is a specific network for doing object detection. It puts the SSD stuff on top of MobileNet, just like you’ve put your own layers on top of VGG.)

1 Like

Cool project! Have you thought about incorporating this into ARKit? Out of curiosity, how did you label your data?

Very cool, thanks for your solution, I would like to use it to “segment” out license plate and smoke, compare results of different solutions in the future.

I remember GAN can help us find out appropriate loss function?Have you tried this out?

Image-to-Image Translation with Conditional Adversarial Networks

Maybe it can generate better results?I do not know.

I built a small tool that lets me quickly move markers around while scrubbing through the video (Fig 1.).
The tool also uses Optical Flow to automatically move the markers along with the moving video, so for most frames, no adjustment is needed once the marker is put into place. Then I built another tool that converts the object locations into target frame activations (Fig 2.)

I hand-annotated about 4000 frames in a few days.

1 Like

Here is a small animation of tracking results on a bunch of frames. After obtaining the output “confidence map” I did a simple non-maximum suppression on the 80x45 output image. That is also the reason that the tracking is a bit jittery, it is currently naively selecting the maximum pixels out of a 80x45 pixel image, with no “sub-pixel” accuracy. I should probably apply some kind of Kalman filter directly on the position confidence map to get smoother tracking and to factor in the prior.

Fig 7: Results in motion


Thomas this is excellent! Its clear your combination of game design and ML skills are paying off!

1 Like

Wow I would never have guessed that! Seems non trivial to build a video annotation tool, right?

I’m curious because I’m working on a image labeling tool myself. I have a load questions, but I’ll send them over chat to avoid distracting from this awesome thread!


I’ve smoothed out the tracking — gaining sub-pixel accuracy — by simply taking a weighted average coordinates of the 3x3 location around each peak:

Fig 8: Smoothing / sub-pixel accuracy

Now I’m looking for a way to deploy this in a C++ ‘production’ app (openFrameworks) that runs both on OSX and Win (and CPU only). My first thought: somehow export to Caffe/Caffe2 ?? (Also looked at tiny-dnn :two_hearts: but unfortunately the import/export options there are limited still)…

Setting it up in C++ seems daunting. :sweat_smile:, so advice very welcome! :pray:


Maybe this post can help you, I create a cross-platform app by Qt5 and opencv3.3(open source), which could run on windows, linux, android, mac. Why no ios?Because I do not have any ios device, so I cannot test it, in theory Qt5 and opencv3.3, both of them can run on ios.

The link come with binary of android, if you have any questions, please leave me a message(ex : how to build on another platforms like windows, mac, linux), please leave me a message.

Please let me join you project if possible, I am quite familiar with c++ and know how to create cross-platform apps with Qt5 and opencv.

I suggest you port to caffe, tensorflow or torch, because the models describe by these libraries supported by the dnn module of opencv3.3.