There are a bunch of papers available on arxiv regarding building density estimation maps and counts, for example: https://arxiv.org/pdf/1608.06197v1.pdf

Supposed I wanted to extend this such that instead of counting the number of *people* in the crowd, I wanted to estimate the number of adults and the number of children.

My training dataset would consist of the raw input RGB images, as well as not one, but two density maps, and two Σ counts – one for children and one for adults. Jeremy’s already shown us how to do multi-output in lesson 7.

The problem is, if the input is some huge image like 3x2000x2000 (RGB), then now even a very shallow net will be too large to fit on the GPU since we will need to hold the input 3x2000x2000 and two single channel density maps (two 1x2000x2000). Is there any other technique to go about doing multiclass counts without generating intermediary density maps? I’ve been experimenting with using lower resolution maps such as 200x200 but was hoping there was a more elegant strategy. There was another paper published about 5 years back where they literally had the network come up with everything just given the desired input image and the factual counts, but that seems like giving the net too much work to do… especially if one has annotated training data.