I am wondering if there are more sophisticated and correct methods apart from simple averaging of individual predictions for image recognition using small patches (tiles).
I have a collection of large res-images (4096x4096) and solving a simple binary classification problem with CNN. Due to large size of the images, I sliced it into 16 tiles, (1024x1024 each) and treat them as ‘independent’ smaller images. I deliberately want to keep high res due to importance to tiny details (I don’t want to interpolate pixels in order not to lose information).
Once my model is trained, I make predictions on each small tile from the same original image (16 predictions) and then want to make the final conclusion about the image (whether is belongs to class 0 or 1).
Naive approach will be to use the majority of votes: if 9 or more images out of 16 belong to class 0, then the full-sized image is from class 0. However, they may contradict each other (8 vote for class 0 and 8 vote for class 1) and thus one needs to impose specific rules how to deal with this situation.
Any better ideas on how to deal with this situation?
Interesting topic. I can’t talk for everyone but my thought process about this would be:
Time constraints: for inference, do I need predictions right after requesting analysis? If not, deferred batch processing can be an option for inference.
Target constraints: proportionally to your image, are my targets really small? If not, resizing it down will do just fine.
Hardware limitation: time constraints will let you choose from CPU up to large GPU options on inference. Now, for the original image size, if your architecture supports it, can you fit it in memory during training? If not, two options: resizing, slicing.
If you are indeed in the case of memory limitation, that batch size reduction cannot be used further, you can either switch to CPU (if you have patience for training), go for your tiling solution or consider gradient accumulation (if you can fit at least 1 image in the batch but for performance purposes, you want a larger batch size). For the later, here is a good snippet for this:
Back to your tiling solution, the easiest solution (but imperfect) I can see is:
rather than averaging, take the maximum score over the tiles of each class (don’t convert the score to probabilities at the end of your model). In your case, it’s binary so it concerns only one class.
now perform softmax or anything squeezing operation you want to get your probabilities
Positive side: it will take into account the superior activation of each class in any other tile. Probabilities would have normalized it and you would have lost the activation value comparison option.
Negative side: by tiling, you reduced the receptive field of your model respectively to this image. So even with the above technique, this will not yield the same performance as fitting everything in a single batch.
In conclusion, not knowing the type of images and targets that you have, I would recommend comparing performances of the resizing option (best option if your target are not tiny so you get fit a sample in your batch), with gradient accumulation (next best option if the problem is only that you cannot fit more than 1 sample per batch in memory) as well as tiling using the maximum score of each class accross the tiles
If there is another piece of information that would change the problem framing, or if this isn’t clear, let me know. Happy to hear other persons’ take on this matter as well
Many thanks for your detailed response, that’s quite interesting! These are medical (microscopy) images and thus attention to tiny details are quite important. I’m using an array of 10x2080Ti GPUs (each of 12GB memory). Of course depending on the CNN architecture, I can fit at most 2048 res (bs=1), but definitely not 4096. With 1024 I can go deeper (larger model).
I will give a go to your suggestions of softmaxing the predictions from each tile. Of course it’s not ideal, but at least a fairly reasonable approach. I’ll keep you posted and once solve, post my solution here.