Any brilliant approaches to wrapping one's head around foundations of statistics?

My 2 cents:
Perhaps not only practice, but studying stats with other “more exciting” context motivates to get it.
For example, curiosity in reinforcement learning helped me to get into Bayesian inference. Perhaps, top to bottom concept worked in this case.

By the way these series of courses on Coursera from University of Amsterdam were surprisingly intuitive for me:
Basic Statistics
Inferential statistics
Quantitative Methods

4 Likes

Thank you, i think this book is super useful

From the book :

Think Stats is based on the idea that Bayesian methods are too important to postpone. By taking advantage of the PMF and CDF libraries, it is possible for beginners to learn the concepts and solve challenging problems.

1 Like

As a beginner, for me what works best is taking Datacamp courses on probablity and statistics (more stats courses in Python to come in April-May), because it is easier to understand the concepts when you can play around with code rather than just math and formulas.

4 Likes

I like the idea of some underlying principles that ‘rule them all’ here: https://lindeloev.github.io/tests-as-linear. Just saw this retweeted from Jeremy, so you probably are aware of the link. But sharing more for the idea, that maybe some things follow an underlying, common principle and learning could be simplified along that lines.

5 Likes

Maybe a stats course itself isn’t the best way to develop the intuition around it. I’ve been recommending David MacKay’s Information Theory, Inference, and Learning Algorithms and most people liked it.
The presentation of its topics and the style of the book fascinates me every time I read in it. To me it’s one of the most brilliant books in the direction of applying probability/statistics.

Best regards

Thomas

4 Likes

ENCYCLOPEDIA STATISTICS IN BEHAVIORAL SCIENCE, BRIAN S. EVERITT, DAVID C. HOWELL… I think it’s a must have book…

1 Like

For me a visual and/or code first approach is key to understanding concepts in math.
This is a great tool to have : https://seeing-theory.brown.edu/index.html

Here are some more materials:

5 Likes

There’s also a MOOC version of the MIT course available at Edx:

https://courses.edx.org/courses/course-v1:MITx+6.431x+3T2018/course/

The archived version should be available for browsing, but cannot double check as I’m Enrolled in the archived version already.

It’s basically the same as the Youtube videos above, but comes with some quizzes and assignment.

I found it really helpful and the instructor(s) is really good!

3 Likes

Would love to know which resource works for you Stas. :slight_smile: Please share your experience afterward.

Edited: Think I may start on reading the Think Stat book soon too.

2 Likes

Talking about books, have you tried this one? https://www.amazon.com/Introductory-Statistics-Analytics-Resampling-Perspective/dp/1118881354

On the other hand, maybe (I’m just thinking aloud) a course in Econometrics would help. I’m not an econometricist, but econometrics is basically probability and statistics (and economics) applied to real world problems, so maybe seeing how these people apply statistics in real life would help you solidify some concepts and gain a more intuitive understanding.

1 Like

Coursera has recently released Statistics with Python Specialization which goes through both theoretical foundations as well as coding with python.

1 Like

This thread continues to add multiple suggestions, so first of all I turned the first post into Wiki - so please add the resources there as there are just too many for me to even be able to look at, not talking about utilizing/trying all of them. In addition to the link, please, add a brief sentence of why you’re recommending the resource you’re adding.

I have already received what I asked for in the first few replies in this thread and I have summarized what it was in the first post. Now it’s just doing it once concept at a time.

But I’m sure others will find benefit in other resources so feel free to share more of your inspirations.

And thank you all for all of your thoughtful and caring contributions.

I benefit from listening through this book on audio: https://www.audible.com/pd/Naked-Statistics-Audiobook/B00CH3UI28

In a listening book no chance for arcane notation :wink: Just intuition and examples …

After I finish and think its good, I’d add it to the wiki post

1 Like

Stas, I am not sure what you mean when you say you can’t wrap your head around statistics. I suspect that you understand stats much better than you think you do. You have a great math background and you have taken the MIT course https://www.edx.org/course/introduction-probability-science-mitx-6-041x-2, which provides a fantastic introduction to the subject. If you have taken this course and gotten a good grade, then you definitely understand statistics!!!

Statistics is all about understanding the properties of distributions. Important basic distributions to know are are: Gaussian, student, uniform, Poisson, Bernoulli, and Binomial. You can gain an understanding of these distributions by studying their properties, such mean, median, mode, and standard deviation. These properties are accessible through differential and integral calculus. The mean measures the average value, the standard deviation measures the spread or dispersion in the distribution, median measures the midpoint of the distribution, so that there are an equal number of samples (or probability mass) less than and greater than the median, and the mode measures the most frequent value of the distribution. Many other distributions arise naturally such as beta, gamma, chi-squared, F, etc. But once you understand how to study distributions, these are easy to add to your repertoire as needed.

Thank you for your prompt, @jcatanza.

I sat with it for a while - I think the problem is quite simple, the first time I studied stats 25 years ago it was taught out of context with no application and I have never used it in my work, I haven’t had any interest in understanding it other than to pass the exam - we knew we had almost no use for it in the following courses. And there was no luxury of free time. The second time around I re-attempted to study it a year ago hearing that it’s important for ML, but again I had no context for it, so I think I did the same thing - understood enough to solve problems, but didn’t quite get the application of it. So third time is the charm, now I’m approaching it in a very different way - as I encounter concepts I either don’t understand or have no intuition of - first of all they must be in a real context of a problem that I’m trying to solve and as others have kindly shared in this thread, coding it and looking at the numbers and how they impact the outcome of the problem I’m trying to solve. So that’s why I’m hesitant of taking any other courses other than studying some specific sections that I need and potentially encountering other things I don’t understand and needing to step back even more. It’s probably a very inefficient way to study something, but without being excited about studying stats on its own, I think currently it’s the best approach for me.

Getting intuition is the difficult part. For example, now I’m trying to get an intuition of why we want variance of 1 for a stable nn behavior, and not, say, 0.9 or 1.5, since all of those would still be well-behaved variations of a normal distribution and each layer gets scaled to the same proportions. So other than the neatness of var==std==1 I’m still unsure why 1 is the number we are after. e.g. it’s quite clear with mean to center the distribution around 0, but why std=1 is unclear to me.

Another example: currently I also don’t understand why we consider a uniform distribution to init weights instead of the normal distribution. I know the foundations of both of those distributions, but I don’t understand why I’d choose uniform over normal in the context of layer init.

Hi Stas,

As for why we consider a uniform distribution rather than a normal distribution to initialize weights, my intuition is that a uniform distribution is bounded, whereas a normal distribution is not; we’d prefer the weights to be bounded.

By the way, if you want to convince yourself that you really do understand statistics, read and work through the excellent, well-written post on kaiming initialization by @PierreO (Pierre Ouannes)

By examining equation (4) in that post, we can answer your question about why the variance of layer l is set to equal \frac{2}{n_l} . We see that if each term in the product \prod_{l=2}^{L} \frac{1}{2}n_lVar[w_l] is equal to one, then the variance of layer L is well-behaved. If Var[w_l] is smaller than \frac{2}{n_l}, then if the number of layers L is large, the variance of the last layer Var[y_L] will shrink towards zero (the “vanishing gradients” problem). On the other hand, if Var[w_l] exceeds \frac{2}{n_l}, then the variance of the last layer Var[y_L] will grow large (the “exploding gradients” problem). You will see that the only way that the variance of the activations in each layer remains well behaved even if there are a large number of layers, is to set Var[w_l] equal to \frac{2}{n_l} !

5 Likes

Well recently I posted the exact same question in another thread, which I will copy-paste here:

I kept thinking about this and came to the following conclusions, which are purely speculative:

  • There is some probability that the “mean=0, std=1” thing came inherited from another area of statistics, where there was some theoretical/practical justification for it (maybe some linear models need it?). There is also a chance that it’s done this way because it’s the easiest thing to do. I mean, when you normalize, the minimum thing you are required to do for it to be a proper normalization is subtract the mean and divide by std, which gives mean=0, std=1. You can’t do less than that. If, for example, you wanted mean=1, you would have to subtract the mean and add 1, which means you have to write more and justify why you arbitrarily added that 1 there.

  • Despite of what I said in my original question, now I’m not convinced that you really need to normalize to mean 0, std 1, but rather that it’s just a convenient starting point. Look at BatchNorm for example. Yes, it starts normalizing to mean=0 std=1, but gives the flexibility for the model to learn to normalize to the mean and std that it finds convenient. You could not normalize your data, put a BatchNorm layer directly before your first conv layer, and let the network itself handle the normalization of your data, and maybe it stays with mean=0 and std=1, or maybe for that particular combination of dataset/network architecture, it figures out that the best thing was to do mean=1.3, std=2.75.

Yes, thank you, @jcatanza - and @PierreO! I did read through that great article - I got lost half way through and will need to revisit it as I’m improving my understanding. But the bottom line - the math for that particular problem of weights init shows what we need.

But when I was writing that note about trying to understand why var==1, I was actually thinking about BatchNorm that I’m trying to sort out right now. And @axelstram’s comment is addressing that head on! So because we use variance in our normalization function and divide by std, then it’s most convenient to keep it close to 1 to not impact the numbers by much. And I love your suggestion to try a learnable BN layer first and see what that would do and let the NN tell us what it thinks those parameters should be. Exciting!

Thank you @jcatanza and @axelstram!

That makes sense in theory, I will try to run the numbers side by side with a small sub-set and get a better feeling from it one group of numbers at a time.

Thank you, @jcatanza!

I thought I heard Jeremy (or someone else) say that it’s because the uniform distribution will not have a peak at 0, so it’s more the relation of how much mass is, say, in [-0.5, 0.5] vs. [-1, -0.5] U [0.5, 1] rather than what’s going on outside [-1, 1]. (If you take the uniform distribution on that latter interval and compare it to the normal distribution of the same variance).

I’d think that the mean=0, std=1 is arbitrary, it’s just that we want something stable across layers without creating too many “dead” activations (e.g. the SELU doesn’t have 1 as the target std I think I have to retract that after re-reading the paper), so we want std(l[i]) = std(l[i-1]) and because we have this degree of freedom at the beginning, we just arbitrarily standardize on 1. That might be the hardest part, to actually conclude that something doesn’t have a reason when you are trying to understand what the reason is.

Best regards

Thomas

1 Like