Objects of variant sizes in image segmentation


I am curious how if the unet_learner and its inherent loss function, is reliable for learning the same object in different scales.
So for example let’s say you have a bike and have it indifferent sizes.
You would like your u-net to learn to discriminate it easily.
In an even easier example without perspective I’ve seen that in objects of different scales the u-net doesn’t just generalize for to the appropriate segmentation.
Could this be due to the fact that I haven’t trained it on more smaller examples, or is there something more inherent to the u-net and the different scales.

Thanks in advance.
PS: I know this question sound broad and I would like you to answer to it broadly.