Why you need a good init - calibrating to zero mean, unit variance from ReLU

Hi all,

I’m vaguely aware that having a zero mean, unit variance statistic for the post-ReLU values in a neural network are important. Sylvain @sgugger showed in the 02b_initializing notebook that the magic number to calibrate to unit variance can be easily derived. I started experimenting with how to achieve zero mean and unit variance. I found something empirically close to what I wanted, but would struggle to justify it in a mathematical sense.

I let a=torch.randn(512,512)*math.sqrt(3/512) and tried to debias the ReLU function manually, eventually settling on y=y.clamp(min=0)-(math.sqrt(2.85)-1). It yielded a mean and stdev of (-0.0011456546001136303, 1.0209426134824753).

I know this is naive. However, the more principled methods can be intimidating, so why not start naive? Does this make sense to anyone else, from a more rigorous point of view? If you start accounting for fan-in terms when calculating the unit-variance calibration-number (i.e. the full He init), does this monkey-level-programming start to fall apart?