I am working on a project that uses satellite images, and for which I am hoping to combine three of the techniques taught in Lesson 3. In particular, I am planning on breaking down my problem into 3 different learners:
Learner1: Input: Satellite Image >>> Output: [XY coord pairs of all buildings] (Similar to BIWI example)
Learner2: Input: [Satellite Image, XY coord of one building] >>> Output: Building use multi-classification (similar to satellite example)
Learner3: Input: [Satellite Image, XY coord of one building] >>> Output: Segmentation mask of building footprint (similar to camvid example)
My most burning question is this:
In the BIWI example, we used regression to output a single coordinate pair. How do I build a regression model that will put out an unknown number of coordinate pairs (one for each building in the picture)?
My secondary question is:
How could my Learner1/2/3 breakdown be improved? Is there a more efficient way to solve this problem? A better ordering of Learners?
Maybe a different framing of the question could be the following:
Is there a way to count instances of an object in an image using a resnet?
Sure! We’ll frame it as a regression problem instead where y is the number of total objects. The assumption is that our model will fit to understand trends where if 5 is the answer, five objects are present. This would be a single regression problem. You can look at how Rossmann is setup for labeling the columns for passing in a y
Why have you broken it down into these 3 problems?
It can indeed be faster at inference time to use some coarse detection like image classification (is a building present) or object detection (where is the building) before narrowing down into costly pixel segmentation on a subset of the data. Or maybe you are just inquisitive and want to understand how using the three techniques will work out on your project.
But I would recommend starting with the segmentation problem if your GPU time allows. With the results of segmentation, it’s usually easy, and just as accurate as object detection, to use a tool like opencv to convert a binary segmentation mask into discrete objects. In my experience throwing classification and detection into the mix as pre-tasks does not improve accuracy significantly.
Thanks you the thoughtful reply Zachary!
I agree that you can use regression to extract a single number from a picture (in this case the number of object instances), which I asked about in my reframing question, but my original question is then one step deeper. That is: can you extract an arbitrary number of numbers from a picture (in this case a pair of coords for each object instance)?
To clarify the difference, imagine two images, one containing 1 object, and the other containing 5. In the structure you’ve suggested there is one value equaling either  or . In the problem I’m trying to solve, the outputs would be different shapes, either [[x1,y1]] or [[x1,y1],[x2,y2],[x3,y3],[x4,y4],[x5,y5]]. Do you have any insight into this challenge?
Thanks for your response–it helps me think about the problem more deeply. Here are a few of my current thoughts:
• I split it into multiple smaller learning processes because a class I took on Google Brain left me with the impression that breaking learning tasks into smaller more specialized chunks led to better performance and easier unit-testing and training data size. For instance a dataset of single dots on objects is far cheaper to create by hand than the same dataset colored in by segmentation (Learner1), and then for the it seems like given a dot, segmenting the pixels immediately around it to find the edges of the building feels cheaper computationally and like it needs less training data by at least an order of magnitude (Learner3). This intuitively feels right, but perhaps you could share more details of experience to the contrary?
• The idea of binary presence-detection to reduce computation load is super helpful!
• I (perhaps naively) believed that the BIWI-style coordinate placement would be less computationally expensive than segmentation, but you seem to suggest they would be roughly equivalent. Am I just misunderstanding what the BIWI project is doing?
Again, I really appreciated your contribution!
Has anyone heard of RepSet?
It seems adjacent to the challenge I put forth here, and I was wondering if anyone is interested in chatting about how their findings might be useful in this particular instance or FastAI in general?