Article or paper regarding the size increase during training (lesson 3)

In the satellite images problem in lesson 3, Jeremy starts off training on size 64 images, then on size 128 and finally on size 256. I don’t understand the theory behind this, is there a source or explanation for why this method works?

The VGG paper ( uses multi scale training, randomly sampling between sz=256 and 512. This leads to a performance improvement. Was the closest thing I found.

Delving deeper into rectifiers ( applies the same technique, also showing performance gain.