In the satellite images problem in lesson 3, Jeremy starts off training on size 64 images, then on size 128 and finally on size 256. I don’t understand the theory behind this, is there a source or explanation for why this method works?
The VGG paper (https://arxiv.org/pdf/1409.1556v6.pdf) uses multi scale training, randomly sampling between sz=256 and 512. This leads to a performance improvement. Was the closest thing I found.
Delving deeper into rectifiers (https://arxiv.org/pdf/1502.01852.pdf) applies the same technique, also showing performance gain.