In lesson 8, Jeremy does some adjustments so that after relu has been applied, the input is still normally distributed with mean 0 and standard deviation 1.
Since the weights were activated as a normal distribution then summing them up creates another Gaussian and thus after the relu we get the following distribution:
| Normal( mu, sigma ) | with probability 0.5, 0 with probability 0.5
For mu = 0 and sigma = 1, this would have a mean and standard deviation of:
expected_std = np.sqrt(0.25 * np.sqrt ( 1 - 2 / np.pi ) + 0.25 / np.sqrt ( 2 * np.pi ))
expected_mean = 1 / np.sqrt ( 2 * np.pi )
Admittedly they are close to what Jeremy was using 0.5004379470500113
and 0.3989422804014327.
I would make both of these adjustments after the relu has been applied and make no adjustment to the weights beyond the sqrt(m) adjustment. That way the expected mean of the result is 0 and the expected standard deviation is 1.
def relu(x): return (x.clamp_min(0.) - expected_mean)/expected_std
w1 = (torch.randn((n_input, n_hidden))/np.sqrt(n_input))
See the wikipedia article: https://en.wikipedia.org/wiki/Half-normal_distribution and of course you have to remember that half of the time the value is zero.