Why not treat activation after relu as a half normal distribution?

In lesson 8, Jeremy does some adjustments so that after relu has been applied, the input is still normally distributed with mean 0 and standard deviation 1.

Since the weights were activated as a normal distribution then summing them up creates another Gaussian and thus after the relu we get the following distribution:

| Normal( mu, sigma ) | with probability 0.5, 0 with probability 0.5

For mu = 0 and sigma = 1, this would have a mean and standard deviation of:
expected_std = np.sqrt(0.25 * np.sqrt ( 1 - 2 / np.pi ) + 0.25 / np.sqrt ( 2 * np.pi ))
expected_mean = 1 / np.sqrt ( 2 * np.pi )

Admittedly they are close to what Jeremy was using 0.5004379470500113
and 0.3989422804014327.

I would make both of these adjustments after the relu has been applied and make no adjustment to the weights beyond the sqrt(m) adjustment. That way the expected mean of the result is 0 and the expected standard deviation is 1.

def relu(x): return (x.clamp_min(0.) - expected_mean)/expected_std
w1 = (torch.randn((n_input, n_hidden))/np.sqrt(n_input))

See the wikipedia article: https://en.wikipedia.org/wiki/Half-normal_distribution and of course you have to remember that half of the time the value is zero.

1 Like

At first I thought why not just re-normalize the data after the relu but after looking at this wiki page and fiddling in excel, I think I see that just generates another subset of negative numbers – thx!

I think the point is to show that you can come up with an idea when you know the lower level material. Testing still had to be done to make sure those ideas actually have an effect. I don’t believe this change really produced fruit from my understanding though. It is important to introduce an experimental mindset for people who want to try to contribute in this way.