Why do cropping in u-net segmentation?

I have a theoretical question about u-net segmentation. To my understanding, the u-net does up-sampling and down-sampling and uses an original cropped image to feed from an early layer to a later layer. Hence the ‘copy and crop’ operation in the diagram at the top of page 2 in the original u-net paper https://arxiv.org/pdf/1505.04597.pdf.

My question is why do any cropping at all? Why not just resize instead? Or at least up-scale until you get images of the same size in the upper layers compared to the lower layers to copy them across. Surely some valuable information on the edges of image is lost. We are talking about significant cropping here (568 by 568 after the first block copied to the beginning of the last block 392 by 392).

1 Like

It looks like they are using the cropping to deal with the smaller output of convolutional layers (i.e. ignoring dimensionality, a convolution with a kernel of size n pixels loses n-1 pixels on each border). So you either have a smaller output or you use padding on convolutions. They chose to proceed with a smaller output.

They are tiling their original input image so use overlapping input tiles to produce a full segmentation in spite of this. They only predict a 368x368px region from each 512x512 input. So it just increases the amount of work needed as tiles must now overlap. No edge information is lost, it’s just processed in other tiles (with reflection padding used on the entire image to handle tiles on the border of the original image). This tiling would likely be needed anyway for the original problem to deal with the full input images with available GPU memory.

Note that the fastai UNet does not follow this, it uses padding to deal with the lost border pixels and so no cropping is done and tiling is not required.

You could try to compensate for the lost border pixels in the resizing but this would seem to have possible drawbacks. In their architecture they use a simple 2x2 upsample (fastai uses PixelShuffle for a 2x2 upsizing). If you were to use resizing to compensate you’d need to use a more complicated method such as bilinear interpolation as you’d need to do things like a resize from 56px to 128px (8px lost from two 3x3 convs as happen between upsizings, assuming a 64px image on the contracting path). This would introduce possible artifacts/information loss due to the issues with arbitrary resizing. It would also increase required processing though it’s not obvious to me how this would compare to the extra processing required by overlap (given that the overlapping pixels are only processed by initial layers as they are dropped by successive convolutions).
So while it might work I would kind of suspect that either the original method of just compensating with overlapping tiles or the fastai method of padded convolutions would be better than resizing. But I could be wrong (especially compared to tiling, padded convolutions as in fastai seems like a good alternative).