Any brilliant approaches to wrapping one's head around foundations of statistics?

stas · April 9, 2019, 6:02pm

Loving your reasoning, @t-v. Thank you!

s.s.o · April 9, 2019, 8:35pm

I think there might be couple of reasons why to use 0 mean and var as 1.

Central limit theorem (taken from wikipedia):
The theorem states that the average of many independent and identically distributed random variables with finite variance tends towards a normal distribution irrespective of the distribution followed by the original random variables. Formally, let X 1 , X 2 , … {\displaystyle X_{1},X_{2},\dots ,} $X_{1},X_{2},\dots ,$ be independent random variables with mean μ {\displaystyle \mu } $\mu$ and variance σ 2 > 0. {\displaystyle \sigma ^{2}>0.,} $\sigma ^{2}>0.,$ Then the sequence of random variables

Z n = ∑ i = 1 n ( X i − μ ) σ n {\displaystyle Z_{n}={\frac {\sum {i=1}^{n}(X{i}-\mu )}{\sigma {\sqrt {n}}}},} $Z_{n}={\frac {\sum {i=1}^{n}(X{i}-\mu )}{\sigma {\sqrt {n}}}},$

converges in distribution to a standard normal random variable.

Continuous probability distributions
Usally DL uses Continuous functions and condition to be used as Continuous distribution functions. It should satisfy the following conditions:

If the outcome space of a random variable X is the set of real numbers ( R {\displaystyle \mathbb {R} } $\mathbb {R}$ ) or a subset thereof, then a function called the cumulative distribution function (or cdf ) F {\displaystyle F,} exists, defined by F ( x ) = P ( X ≤ x ) {\displaystyle F(x)=P(X\leq x),} $F(x)=P(X\leq x),$ . That is, F ( x ) returns the probability that X will be less than or equal to x .

The cdf necessarily satisfies the following properties. (taken from wikipedia)

F {\displaystyle F,} is a monotonically non-decreasing, right-continuous function;
lim x → − ∞ F ( x ) = 0 ; {\displaystyle \lim _{x\rightarrow -\infty }F(x)=0,;} $\lim _{x\rightarrow -\infty }F(x)=0,;$
lim x → ∞ F ( x ) = 1 . {\displaystyle \lim _{x\rightarrow \infty }F(x)=1,.} $\lim _{x\rightarrow \infty }F(x)=1,.$

If F {\displaystyle F,} is absolutely continuous, i.e., its derivative exists and integrating the derivative gives us the cdf back again, then the random variable X is said to have a probability density function or pdf or simply density f ( x ) = d F ( x ) d x . {\displaystyle f(x)={\frac {dF(x)}{dx}},.} $f(x)={\frac {dF(x)}{dx}},.$

Related to SELU a nice explanation is given here. Mainly it is self normalizing …

Benudek · April 9, 2019, 8:36pm

really? Is this so? Because I was actually dwelling on this now for a while thinking there is some reason I miss. Could well be, why not - an arbitrary choice.

jeremy · April 10, 2019, 4:04am

I believe the short version is that the eigenvalue of a random matrix with std 1 is equal to 1. You don’t want to be multiplying by a matrix with an eigenvalue that’s consistently higher or lower than 1, since otherwise you get exponentially increasing or decreasing activations. For deep nets, that’s a big problem! (i.e. floating point accuracy decreases for very large or very small numbers. IIRC lesson 1 of the computational linear algebra course shows the details of this.)

kabir · December 19, 2019, 4:16am

StatQuest youtube videos are quite helpful. Hope it helps.

stas · December 23, 2019, 6:27pm

Indeed, short and to the point. Thank you, @kabir

harikrishnanrajeev · December 26, 2019, 4:02am

Thank you for sharing. Simple and useful.

fortunewalla · June 17, 2022, 6:12am

nice article.