In Lesson 2, the two consecutive sections “Interpreting Models - P value” and “Null Hypothesis Significance Testing” from 45:50-1:02:48 are, in my opinion, very important not only for machine learning, but for statistics, science, and metascience (the science of how best to do science) as well. I want to use this post to start a conversation about how the practices within machine learning should and should not impact general statistical and scientific practice. I would love to start a study/discussion group about this if there are enough people interested.
I’ll begin with a paper that @rachel and @jeremy have both mentioned in the Computational Linear Algebra and the new Deep Learning for Coders courses, respectively, the famous Leo Breiman paper “Statistical Modeling: The Two Cultures.” In this paper, Breiman contrasts two goals of statistics, Explanation and Prediction. Explanation represents how most science is currently conducted, using simple, interpretable models (e.g., linear and logistic regression, Cox models) that are meant to represent the exact relationship between inputs as they are processed by Nature into outputs. The quality of these models is measured by how well they fit a dataset (e.g., how well they minimize residuals). In contrast, Prediction deals with the kinds of models machine learners know and love (e.g., neural nets, random forests) that are not meant to be a stand-in for nature, but are only meant to accurately predict outputs from inputs, not being concerned with explaining the inner workings of Nature.
Important to note, the models mentioned above in the Explanation camp are not very predictive, although they are interpretable, and there is a tradeoff between predictive performance and interpretability that is explored by Bzdok and Ioannidis in this paper.
Another thing Jeremy noted in this lecture and in a previous run of the course is that the reproducibility problems in science are analogous to the overfitting problem in machine learning. The basic idea is, many scientific studies are underpowered (“power” as in “statistical power” is the probability that the test rejects the null hypothesis when a specific alternative hypothesis is true — i.e., it indicates the probability of avoiding a type II error), and this lack of statistical power largely comes from low sample sizes. Hence, when the linear/logistic regression model is ran and the residuals are small and the p-value is below 0.05, a false positive result gets published that likely will not replicate. There are many, many more issues that are well-known to the field of metascience and have been for decades, a good start for readers is the famous 2005 paper by Ioannidis, “Why Most Published Research Findings Are False.”
There are many potential positive changes that can be made to science in the near term, and insights from machine learning may help facilitate them to a substantial degree:
- The emphasis on prediction may lead to the generation of scientific findings that are replicable and reproducible, and of scientific hypotheses that are better supported by data
- Applications of explainable AI (XAI) tools like Gini importance, SHAP, Terence Parr et al.'s recently updated Stratified Impact method, and others that may help bridge the gap between Explanation and Prediction by making the highly predictive models we all know and love more interpretable
- Not so machine learning-related, but a recent preprint by the Many Labs group shows that better scientific transparency practices (high statistical power, preregistration, and complete methodological transparency) highly increase reproducibility, which is sorely needed across many fields of science.
Other papers relevant to the discussion:
American Statistical Association’s statement on p-values (mentioned during the 45:50-1:02:48 section of the lecture linked at the beginning of this post)
A Dirty Dozen: Twelve P-Value Misconceptions
The practical alternative to the p-value is the correctly used p-value
I hope this post generates some discussion!