Lesson 2 - P-values, Leo Breiman's Explanation vs. Prediction framework

randy912 · September 14, 2020, 4:18pm

In Lesson 2, the two consecutive sections “Interpreting Models - P value” and “Null Hypothesis Significance Testing” from 45:50-1:02:48 are, in my opinion, very important not only for machine learning, but for statistics, science, and metascience (the science of how best to do science) as well. I want to use this post to start a conversation about how the practices within machine learning should and should not impact general statistical and scientific practice. I would love to start a study/discussion group about this if there are enough people interested.

I’ll begin with a paper that @rachel and @jeremy have both mentioned in the Computational Linear Algebra and the new Deep Learning for Coders courses, respectively, the famous Leo Breiman paper “Statistical Modeling: The Two Cultures.” In this paper, Breiman contrasts two goals of statistics, Explanation and Prediction. Explanation represents how most science is currently conducted, using simple, interpretable models (e.g., linear and logistic regression, Cox models) that are meant to represent the exact relationship between inputs as they are processed by Nature into outputs. The quality of these models is measured by how well they fit a dataset (e.g., how well they minimize residuals). In contrast, Prediction deals with the kinds of models machine learners know and love (e.g., neural nets, random forests) that are not meant to be a stand-in for nature, but are only meant to accurately predict outputs from inputs, not being concerned with explaining the inner workings of Nature.

Important to note, the models mentioned above in the Explanation camp are not very predictive, although they are interpretable, and there is a tradeoff between predictive performance and interpretability that is explored by Bzdok and Ioannidis in this paper.

Another thing Jeremy noted in this lecture and in a previous run of the course is that the reproducibility problems in science are analogous to the overfitting problem in machine learning. The basic idea is, many scientific studies are underpowered (“power” as in “statistical power” is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true — i.e., it indicates the probability of avoiding a type II error), and this lack of statistical power largely comes from low sample sizes. Hence, when the linear/logistic regression model is ran and the residuals are small and the p-value is below 0.05, a false positive result gets published that likely will not replicate. There are many, many more issues that are well-known to the field of metascience and have been for decades, a good start for readers is the famous 2005 paper by Ioannidis, “Why Most Published Research Findings Are False.”

There are many potential positive changes that can be made to science in the near term, and insights from machine learning may help facilitate them to a substantial degree:

The emphasis on prediction may lead to the generation of scientific findings that are replicable and reproducible, and of scientific hypotheses that are better supported by data
Applications of explainable AI (XAI) tools like Gini importance, SHAP, Terence Parr et al.'s recently updated Stratified Impact method, and others that may help bridge the gap between Explanation and Prediction by making the highly predictive models we all know and love more interpretable
Not so machine learning-related, but a recent preprint by the Many Labs group shows that better scientific transparency practices (high statistical power, preregistration, and complete methodological transparency) highly increase reproducibility, which is sorely needed across many fields of science.

Other papers relevant to the discussion:
American Statistical Association’s statement on p-values (mentioned during the 45:50-1:02:48 section of the lecture linked at the beginning of this post)
A Dirty Dozen: Twelve P-Value Misconceptions
The practical alternative to the p-value is the correctly used p-value

I hope this post generates some discussion!

mycarta · September 23, 2021, 12:42am

First of all, I find very valuable your post, and the additional reference papers. Thank you.

Although I am late to this discussion (I just started the course and reading the book this month) this is a topic very close to my heart, so I thought I’d share some other useful bits.

I really enjoyed @jeremy ‘s take and how he used simulation to challenge the paper conclusions’ robustness (and repeatability). Along the same lines, I found really inspiring a Stanford webinar I attended last year, titled: How to be a statistical detective. It may still be possible to watch it here.

It is a good seminar, even though at a basic level, because it shows interesting examples of how the presenter found questionable statistics in published medical research, and how to find them in other papers using simple tools that do not require coding (or even Excel). These methods are applicable to other disciplines.

For those that do not have time to watch the webinar, I compiled a list of the tools below:

To check for P-hacking or Data dredging – use statcheck: http://statcheck.io/

To find impossible mean values – use the granularity-related inconsistency of means (GRIM) test: PrePubMed

To digitize plots/graphs (and run some tests on the data, or re-plot) - use WebPlotDigitizer.

Conwyn · September 24, 2021, 6:53pm

Hi Randy and Matteo
You may find this interesting ( Noam Chomsky on the Future of Deep Learning | by Andrew Kuo | Towards Data Science ).

I do not understand statistics in the sense of there being 2.5 children in every family so statistics is an “averaging” function and I think deep learning is similar. Einstein said he never understood Quantum Mechanics. There used to be a picture of a potato which looked like President Nixon so humans are pattern searching animals.
Traditionally NLP was based on the levels of syntax,grammar semantics but in the early 80’s a programmer wrote a series of rules using syntax alone which was roughly equal to a five year old. And I question whether Deep Learning has achieved understanding or it is similar to me uttering a fews foreign words on holiday when I want a coffee.

So deep learning is great for finding things of interest hidden in reality and mankind has always being practical before theoretical. Atomic bombs before power stations. So deep learning may guide science but we need to understand deep learning.

Regards Conwyn