We do not initialize weight matrices with zeros because the symmetry isn’t broken during the backward pass, and subsequently in the parameter updating process.

But it is safe to set the bias vector up with zeros, and they are updated accordingly.

Why is it safe to do so, and not the opposite?

Why can’t we initialize bias vectors with random numbers and weight matrices with zeros? (Question-1)

My initial thought is that a vector is of rank (n, 1) \text{ where } n \in \mathbb{N}. This is not true for a matrix. And thus symmetry does not really come into play in the case of vectors.

But that does not answer the question that each layer of a deep neural network has its own weight matrix, and there is no need for symmetry across different layers.

So, when we talk about symmetry are we talking about symmetry across different rows of the same matrix? (Question-2)

Column wise symmetry should not matter much as they are for different training examples (for the first hidden layer). Does column-wise symmetry disturb the training process much in the case of hidden layers other than the first one? (Question-3)

The symmetry is easier understood when you dont think of W as a matrix, but instead think of it as vectors stacked on top of each other. Each vector is multiplied with the input, and that is the output for the a single neuron in the layer. If you have all the vectors of W to be the same (zero or anything else), the the vectors times inputs will always be the same. So you are essentially calculating output of a single neuron multiple times, and this will not help your network learn. You basically have a layer with just 1 hidden unit. This is the symmetry you are trying to break. You want each neuron to calculate different outputs. Hence you want the vectors stacked together to be different. This is why random initialization for weights is common.

The problem I see with initializing bias to random and weights to zero to break symmetry is the following.
the output of an nn is y = Wx + b. If weights are zero and bias is random, each neuron is comuting an output that is independent of the weights. It could work in the long run, but the training time to get this to converge would be very long. You want the network to learn from inputs straight away. This is my guess.

It’s not ok to initialize the weight matrices to zero because the output of the model will be constant for every input. The derivative of weights wrt loss will be zero, therefore it cannot learn by gradient descent.
(Edit: Sorry - that’s wrong. It’s the derivative of x wrt loss that will be zero.)

What is the best way to initialize bias? I don’t know! It may have something to do with shifting the outputs to a “sweet spot” of the activation function. But I am just making that up.

I haven’t read much about this but what I have seen seems to suggest that not starting with zeros for biases may actually be worse, but I don’t think it’s been studied a great deal afaict.

So, it is more about redundancy, rather than “symmetry”, per se. When we have multiple neurons having the same weights, they become essentially redundant.

I was reading the LeCun paper on backprop, and this is the insight that I gained.

Yeah that is essentially it. You could call it redundant, or the neurons becoming symmetrical. Terminology isn’t as important, but as long as you know why it happens you can explain it with any term