Thanks!. But I have another thought on this.
In lesson 6 Jeremy first trains the model with low resolution images and then makes a second round with an higher one.
He also mentions this approach of incrementally raising the resolution in lesson 1 or 2 to achieve a better training.
If I understood convolutions correctly, the more we proceed on the layers, more and more complex structures are “recognized” and generate activations: we start with simple edges and then these features become parts of more complex patterns, like circles, repeating rectangles and so on.
At the end of the network, each channel of the final tensor (with a coarse resolution) contains the activations of these particular sets of complex features. We avg pool or max pool to size 1 in height and width it to have the final global level of activations independently of their position
On the last fully connected layer, the classification is done using a fully connected layer in order to estimate the best combination of complex features that uniquely define the target class.
But in my understanding, these structure recognition is not invariant in respect on the pixel size of the structures themselves, due to the fixed size of the convolution kernels.
Therefore (i know that the following statement may be an over-semplification) the set of complex features that define a car that occupies 80% of a 640x480 image may be some black round things of size 20 pixels (the tires) with some trapezoid glassy looking things of around 30 pixels (the car windows) and some round smaller round things (the front lights).
But a car that occupies 80% of a 2048*1300 or whatever 4k image, is defined by probably a different set of features, maybe the structure of the handles of the doors, the form factor of a Ford logo. And maybe the features that had been trained to be relevant for a car in the 640x480 image size pass will be recognized by a smaller car in a 4k image.
Therefore training on different resolutions on different iterations is not important in order to train the network on better understand the visual concept of a generic car, but to be able to recognize cars of different sizes in relation of the kernel height and width. Therefore is not so different on the zooming transformation for data augmentation.
Thanks for reading this long post and for sharing your thoughts and letting me know if I am right or wrong.
F