Loving your reasoning, @t-v. Thank you!
I think there might be couple of reasons why to use 0 mean and var as 1.
- Central limit theorem (taken from wikipedia):
The theorem states that the average of many independent and identically distributed random variables with finite variance tends towards a normal distribution irrespective of the distribution followed by the original random variables. Formally, let X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots ,} be independent random variables with mean μ {\displaystyle \mu } and variance σ 2 > 0. {\displaystyle \sigma ^{2}>0.,} Then the sequence of random variables
Z n = ∑ i = 1 n ( X i − μ ) σ n {\displaystyle Z_{n}={\frac {\sum {i=1}^{n}(X{i}-\mu )}{\sigma {\sqrt {n}}}},}
converges in distribution to a standard normal random variable.
- Continuous probability distributions
Usally DL uses Continuous functions and condition to be used as Continuous distribution functions. It should satisfy the following conditions:
If the outcome space of a random variable X is the set of real numbers ( R {\displaystyle \mathbb {R} } ) or a subset thereof, then a function called the cumulative distribution function (or cdf ) F {\displaystyle F,} exists, defined by F ( x ) = P ( X ≤ x ) {\displaystyle F(x)=P(X\leq x),} . That is, F ( x ) returns the probability that X will be less than or equal to x .
The cdf necessarily satisfies the following properties. (taken from wikipedia)
- F {\displaystyle F,} is a monotonically non-decreasing, right-continuous function;
- lim x → − ∞ F ( x ) = 0 ; {\displaystyle \lim _{x\rightarrow -\infty }F(x)=0,;}
- lim x → ∞ F ( x ) = 1 . {\displaystyle \lim _{x\rightarrow \infty }F(x)=1,.}
If F {\displaystyle F,} is absolutely continuous, i.e., its derivative exists and integrating the derivative gives us the cdf back again, then the random variable X is said to have a probability density function or pdf or simply density f ( x ) = d F ( x ) d x . {\displaystyle f(x)={\frac {dF(x)}{dx}},.}
Related to SELU a nice explanation is given here. Mainly it is self normalizing …
really? Is this so? Because I was actually dwelling on this now for a while thinking there is some reason I miss. Could well be, why not - an arbitrary choice.
I believe the short version is that the eigenvalue of a random matrix with std 1 is equal to 1. You don’t want to be multiplying by a matrix with an eigenvalue that’s consistently higher or lower than 1, since otherwise you get exponentially increasing or decreasing activations. For deep nets, that’s a big problem! (i.e. floating point accuracy decreases for very large or very small numbers. IIRC lesson 1 of the computational linear algebra course shows the details of this.)
StatQuest youtube videos are quite helpful. Hope it helps.
Thank you for sharing. Simple and useful.
nice article.