In lesson 8 of the Fully Connected notebook, Jeremy discusses about the Kaiming initialisation, and in the paper, the formula of the standard deviation is replaced as

\text{std} = \sqrt{\frac{2}{(1 + a^2) \times \text{fan_in}}}

In this, I don’t understand why we are multiplying with sqrt(2) to improve the mean and standard deviation after performing Relu. We can also use sqrt(3) instead of 2 as well, and after trying sqrt(3), I found the standard deviation of ‘’‘t = relu(lin(x_valid, w1, b1))’’’ came out even closer to 1 than sqrt(2). Can someone please explain me this?

Thanks.