Presence score in keypoint detector

So I want to train a network that when fed in a cropped RGB image of a hand while giving back both key points and how confident it is that there is a hand there. Getting keypoints is a relatively simple regression/classification problem but my issue is what my data should look like for the presence score.

Would it mean that I would have to feed in images that have no hands in it? It seems like that could cause problems but I’m not sure how else I would attack this problem.

Basically this


Thanks for the help!