I think just using upsampling can’t get any good result. After Resnet/VGG, information is quite compressed, and upsampling layers simply doesn’t have enough information to reconstruct the mask.
I guess VGG has more parameters, it can remember patterns of each part more or it simply might have more features at the end of VGG layers than Resnet’s
Anyway, without architecture like UNET, there is not much point of benchmarking.
But I’m interested in using resnet though my gut feeling is that the images are quite uniform and the goal is also simple, so too many layers in resnet doesn’t help much.
BTW, you wouldn’t get a good score without using full-resolution - (I think somebody put analysis in the forum about theoretical score limit with scaled-down images.) And to train the full-resolution image, you’ll need extra tricks.
Check my earlier post for summary of approaches from winning teams: