Any brilliant approaches to wrapping one's head around foundations of statistics?

Loving your reasoning, @t-v. Thank you!

I think there might be couple of reasons why to use 0 mean and var as 1.

  1. Central limit theorem (taken from wikipedia):
    The theorem states that the average of many independent and identically distributed random variables with finite variance tends towards a normal distribution irrespective of the distribution followed by the original random variables. Formally, let X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots ,} X_{1},X_{2},\dots , be independent random variables with mean μ {\displaystyle \mu } \mu and variance σ 2 > 0. {\displaystyle \sigma ^{2}>0.,} \sigma ^{2}>0., Then the sequence of random variables

Z n = ∑ i = 1 n ( X i − μ ) σ n {\displaystyle Z_{n}={\frac {\sum {i=1}^{n}(X{i}-\mu )}{\sigma {\sqrt {n}}}},} Z_{n}={\frac {\sum {i=1}^{n}(X{i}-\mu )}{\sigma {\sqrt {n}}}},

converges in distribution to a standard normal random variable.

  1. Continuous probability distributions
    Usally DL uses Continuous functions and condition to be used as Continuous distribution functions. It should satisfy the following conditions:

If the outcome space of a random variable X is the set of real numbers ( R {\displaystyle \mathbb {R} } \mathbb {R}) or a subset thereof, then a function called the cumulative distribution function (or cdf ) F {\displaystyle F,} F, exists, defined by F ( x ) = P ( X ≤ x ) {\displaystyle F(x)=P(X\leq x),} F(x)=P(X\leq x),. That is, F ( x ) returns the probability that X will be less than or equal to x .

The cdf necessarily satisfies the following properties. (taken from wikipedia)

  1. F {\displaystyle F,} F, is a monotonically non-decreasing, right-continuous function;
  2. lim x → − ∞ F ( x ) = 0 ; {\displaystyle \lim _{x\rightarrow -\infty }F(x)=0,;} \lim _{x\rightarrow -\infty }F(x)=0,;
  3. lim x → ∞ F ( x ) = 1 . {\displaystyle \lim _{x\rightarrow \infty }F(x)=1,.} \lim _{x\rightarrow \infty }F(x)=1,.

If F {\displaystyle F,} F, is absolutely continuous, i.e., its derivative exists and integrating the derivative gives us the cdf back again, then the random variable X is said to have a probability density function or pdf or simply density f ( x ) = d F ( x ) d x . {\displaystyle f(x)={\frac {dF(x)}{dx}},.} f(x)={\frac {dF(x)}{dx}},.

Related to SELU a nice explanation is given here. Mainly it is self normalizing …

really? Is this so? Because I was actually dwelling on this now for a while thinking there is some reason I miss. Could well be, why not - an arbitrary choice.

I believe the short version is that the eigenvalue of a random matrix with std 1 is equal to 1. You don’t want to be multiplying by a matrix with an eigenvalue that’s consistently higher or lower than 1, since otherwise you get exponentially increasing or decreasing activations. For deep nets, that’s a big problem! (i.e. floating point accuracy decreases for very large or very small numbers. IIRC lesson 1 of the computational linear algebra course shows the details of this.)

7 Likes

StatQuest youtube videos are quite helpful. Hope it helps.

3 Likes

Indeed, short and to the point. Thank you, @kabir

1 Like

Thank you for sharing. Simple and useful.

1 Like

nice article.