Generally, CNN(Convolutional neural network) is composed of two parts:

convolution layer

fully connected layer

My question is,
Should Initialization method for CNN(convolution layer + fully connected layer) be the same?

In my opinion,
there may be a good initialization method for the part of convolution layer,
and another good initialization method for the part of fully connected layer.
(I don’t know if this is the case.)

The same method or different, which method should I take?
And what initialization method is good for CNN?

Kaiming initialization should be fine for both. Kaiming initialization will scale the distribution your initial weights are drawn from by the size of the weight matrix, so the convolutional kernels and the linear layers will automatically pull from different distributions.

Hmm…despite everything said in Lesson 9, pytorch still uses a=math.sqrt(5) in pytorch/conv.py as follows:

def reset_parameters(self):
n = self.in_channels
init.kaiming_uniform_(self.weight, a=math.sqrt(5))
if self.bias is not None:
fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
bound = 1 / math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)

which would give activation outputs a variance much lower than 1 if we are using a Relu after the CNN layer. Wonder if there is a need to replace the a=math.sqrt(5) with a= 0 manually?

I find for CNNs if I build architectures with residual connections that the default initialization does not work so well. The loss will start at 10+ sometimes even in the hundreds instead of around 3. Any advise on how to initialize when you are adding layers together like that?