I am unable to understand how layers closer to the image/pixels detect simple features like edges or straight lines. The weights of these initial layers are chosen randomly then what makes these weights adjust themselves to these simple features.
The reason is that they cannot learn to detect features more complex than edges or lines.
Indeed, first layer is most of the time comprised of 3x3 (sometimes higher like 5x5 but this doesn’t change much the intuition). It means that the filters that they can learn are very low resolution and the features they can extract must be very simple.
To extract more complex features, we would need larger filters, with a higher resolution but that would be too costly for our network. This is the reason why we stack several layers, to increase the complexity and the expressiveness of the network. Thus, deeper layers, because they combine simple features, are able to extract increasingly complex features. This hierarchical architecture is what make neural networks extremely efficient.
It is described visually and better than I did in a old fastai notebook: https://github.com/fastai/courses/blob/master/deeplearning1/nbs/convolution-intro.ipynb
Take a look at this fantastic ressource to have an interactive introduction to filters: http://setosa.io/ev/image-kernels/
Maybe this paper can also help you to dig a little bit deeper in CNN: https://arxiv.org/pdf/1311.2901.pdf
Visual field for initial later is very less… They can’t see more…and so they learn simple patterns. The deeper you go the visual receptive field increases, thanks to Max pool, strided convs… You get to see the big picture… HTH.
Thanks a lot for this detailed explanation. I now understand that it is the low resolution(because of small size of filter or kernel) of these filters which makes them resolve simple shapes like horizontal or vertical edges.
This is also in line with how human vision processes information early on. This is a very interesting and accessible paper on the subject: The Plenoptic Function and the Elements of Early Vision
You will find the same kind of filters in the paper that the initial layers of your network learns.
Thanks for this reference