Init Bias with Kaiming

Can some clarify below things about bias init

  1. Why do we init Bias as Zero ,cant we init them in the same way we init the weights using Kaiming

  2. Similarly in Batch Norm we init the parameters Gama as Ones and Beta which is bias as Zero . Instead why we dint init them as Kaiming like we do for the weights.
    In nutshell every learnable parameter weights or bias should be init using kaiming or other init methods but we init biases as zeros.So curious to understand this.

It is very simple arithmetic … you should write down what kind of operations take place in a layer. Then you will be able to answer your question by yourself :wink:

1 Like

tl; dr: As long as you initialize the weights properly (randomly), the output from each of the neurons will be different. Thus, the gradients that are propagated back would be different, and the network will be trained properly. On the other hand, if every weight (in addition to the biases) was also set to 0, the output from each of the neurons, and thus the gradients, will be identical.

Since the purpose of having each neuron generate a different output is solved by setting weights to small, random values, the biases can be set to 0.

I’m not sure how the answer will be derived from layer operations. @fabris can you please clarify?

2 Likes

Simplified the operation is:
x * input + b

My guess would be that the multiplication is the problematic part in the formula that can cause vanishing or exploding gradients due to the nature of the operation.

In a deep neural network, the output of the simplified operation will be fed to the next layer, and thus will invariably be involved in multiplication.

output1 = w1 * input + b

output2 = w2 * (w1*input + b)

The network is thus a sequence of multiplications, and that’s how the problems of vanishing and exploding gradients arise (more details).

ok thanku…
my thought process since both w and b are leranable parameters of network ,having init B to non zero value could give network even better head start than init only weights .
But looks like as per the note give in the link this dsnt works the way i m perceiving .

I think it would be more fair to use
w2 * (relu(w1*input+b))

By adding bias you are directly manipulating the ease of the features making it through the activation layer, regardless of input. We want 0 bias to begin with because we want w1 contributing to activations, not the bias. The bias is a way for us to give more weight to smaller activations. For example if we detect feet, legs, eyes and a tail, but then we detect a a few low resolution scales, then what we are working with is probably not a cat, but maybe a lizard. Bias gives us a way to raise the importance of activations that were “weak.”

Or that is my understanding of it anyway. Only way to really ever have a good grasp is to experiment yourself though :slight_smile:

For example if we detect feet, legs, eyes and a tail, but then we detect a a few low resolution scales, then what we are working with is probably not a cat, but maybe a lizard. Bias gives us a way to raise the importance of activations that were “weak.”

Whatever effect bias has will be seen by all the classes, since bias is independent of the features (or you can consider bias to be a regular weight node, with the feature always set to 1). You may want to see this link for details.

Yes, it was a bad example I realize in retrospect. My example would more pertain to weights.

wx+b=y
b=0.5
w=0.5
0.5x+0.5=y

So you would need a very negative activation(-1) to cancel out the effect of bias. So it should instead be, assume not cat if there is not a lot of fur.
w_f=fur weight
b_f=fur bias

w_f x+b_f=y
b_f=-0.5
w=0.5
0.5x-0.5=y
So you need a lot of fur in (x)>1 for the output to make it through Relu.

Thank you for the correction.

1 Like