I understand how cnn learn images by hierarchy. But how would a deep neural network without convolutions learn it? Since each layer neurons would look at all the previous layer neurons, the weights would be distributed across the previous layer. But in what way would they learn to recognise objects or features?
In theory, an FC layer could learn the exact same features as a conv layer. (You can implement a conv layer as an FC layer, and vice versa.) However, it’s unlikely that it would.
The convolution applies certain restrictions to the learning process, in order to guide it to learn more effectively (only look at a small group of pixels, use the same weights for each pixel) whereas an FC layer has total freedom to learn anything it wants.
The FC layer also does not look at the image data as a 2D matrix but as a 1D vector so it does not see all the spatial relationships between the pixels.
But the FC layer still picks up interesting relationships between the pixels from the input image. It’s just not as effective as a convolutional layer at picking up the relationships that really matter.
So if there are 5 layers of convolution, first layer would learn simple features like edges, second more complex features etc…what about the FC network. How would it learn? What would the first layer of FC network learn?
I don’t know if the FC layers learn something that makes sense to a human being. All I can say is that they learn some kind of (non-linear) transformation.
One way to find out what an FC layer has learned is to do visualizations, i.e. see what neurons get activated when. But since an FC layer is a 1D vector it won’t make as much sense as the 2D activations that you see in a convolutional network.
Makes sense. Thanks
To complete a bit the answer of @machinethink, a Fully Connected Neural Network (FCNN) could learn as a CNN, but the computational process is more heavy.
It is possible to deploy only FCNN to learn some tasks. I already did it on some application, specifically on human age and gender detection. And you could see the learning process through the learning curves (accuracy and loss).
To build it you should use only FC layers of your favorite framework (TensorFlow, Keras, Pytorch, etc.).
From my point of view, it would be one of the ideal things to do if we had required computational resources, as it is heavy. Because it keeps all the information about the input inside layers during all the learning process.
This is why a FC layer is often used at the end of a CNN to reconnect the learned features together before prediction.
I’d like to add my 2 cents
It seems to me that what makes CNN architecture work very well in learning a compact low dimensional representation which can be used to effectively solve different non-trivial tasks with images (classification, detection, …) is the inductive bias regarding the hierarchical structure (limited receptive field + spatial pooling so to cover the full FoV at the highest ConvLayer) and the weight sharing which are biologically inspired from Animals Visual Cortex
If Nature has selected this kind of model for visual perception instead of an “unbiased” fully connected one, there might be a reason