I used myModel.summary() to find out the number of trainable parameters that the first Convolution2D has, and this number is 1792.

The input has 3 channels (r,g,b) (I am using theano configuration, therefore channels_first), the layer uses 64 filters, and each filter has the spatial size 3x3.

Therefore there should be 64x3x3x3=1728 “filter-matrix-elements”. Since I did not specify it further, it is bias=True, which is default.

I would suspect that the remaining 1792-1728 = 64 parameters come from the bias. Now my question: What exactly are the degrees of freedom for the bias? Looking at the number 64, I would guess that every output channel (all 64 channels from the 64 filters) is just added to a constant number, which differs from channel to channel, but is the same for all 224x224 elements of a single channel.

Is that correct? And if yes, is there a special reason why that makes more sense than just adding a different number to all the 64x224x224 numbers of the output?

In practice, biases are associated with filters, and not the activation maps.

But if we go your way, then after convolving the incoming data with the weights, we will need another layer of biases matching the output dimensions, which will need to be added to the convolution to produce the activation maps/output channels.

Now, what’s wrong with that?

a) Your model becomes very very … biased! With so many biases, your model will be on the extreme bias end of the bias-variance tradeoff. It will respond less to your training data because it would have all these independent biases floating around.

b) Training time (and memory requirements) will increase. Now we need to adjust so many biases along with the weights. Remember that for every additional parameter, the network needs to cache the forward pass results to be used during backward pass.

a) This is my guess: Weights sharing becomes difficult. An activation map is a result of convolving the same set of filter weights across the incoming layers. So the weights are shared and they need to come to an agreed set of values to generalize over all training data. I think it will be difficult (i.e. will take long time to converge or may not be able to find the right set of values) for them to do this if the upstream layer has these independent biases.

Otherwise biases increase representational power of neural networks. For e.g., what should your network output be if if all inputs are 0? That would be taken care of by the biases.

You can check this page and look at the interactive visualisations as well. When bias is changed, slope of the input-output curve is changed keeping the shape constant. if you try to change the value of all weights (shifting weights), shape itself changes and that is not main idea behind bias. So, there are bias elements connected to each output neuron to shift output neuron’s values. Bias size is thus the number of output neurons.