I was watching @jcjohnss new videos on Computer Vision and now I’m confused about Fixup init. If last conv layer in a residual branch is initialized to zero, then there would be no symmetry breaking amongst filters and all the filters learned in that layer would be identical
To clear another doubt, would all elements of a single filter receive the same gradients as well? No, because the local gradient on the weight is going to be equal to to the activations of the previous layer which will be different, so there is symmetry breaking
As long as you don’t have two consecutive layers both initialized at zero, and as long as there is at least one randomly initialized layer on any path from the input to the output, the model will will be able to learn.
But here each filter will receive the same gradients as it goes over the same input/activation. In the last conv layer that was initialized with zeros, all filters learned should be identical. Are redundant features being learnt in the residual layers using Fixup?