Does anyone have a good understanding of what goes on under the hood when training a convolutional layer? How do the filter weights actually get optimized? Any hints/links will be really helpful.

What I am unable to wrap my head around is that while each filter matrix has a very small number of parameters it ends up talking to all the pixels of the input image to get the new filtered/convolved image. I am thinking of the mapping between these two images (with the convolutional layer in between them) in terms of a matrix (bigger than 3x3), but because the filter is the same, the elements of this matrix can be obtained in terms of the original parameters of the filter. What does this matrix then look like?

Alternatively, how do you do train a neural network for the case when parameters are shared between different inputs and outputs?

Hoping my question makes sense, and somebody out there has an answer for me.