Controversy: Causal Inference in DL (& ML)

Hello, everyone:

The Controversy

Judea Pearl and his critiques of current ML has been making the rounds lately due to his new book The Book of Why. From an interview he gave recently:

There are circles of research that continue to work on diagnosis without worrying about the causal aspects of the problem. All they want is to predict well and to diagnose well… I felt like an apostate when I developed powerful tools for prediction and diagnosis knowing already that this is merely the tip of human intelligence. If we want machines to reason about interventions (“What if we ban cigarettes?”) and introspection (“What if I had finished high school?”), we must invoke causal models. Associations are not enough—and this is a mathematical fact, not opinion.

As much as I look into what’s being done with deep learning, I see they’re all stuck there on the level of associations. Curve fitting. That sounds like sacrilege, to say that all the impressive achievements of deep learning amount to just fitting a curve to data. From the point of view of the mathematical hierarchy, no matter how skillfully you manipulate the data and what you read into the data when you manipulate it, it’s still a curve-fitting exercise, albeit complex and nontrivial.>

This a very controversial accusation, especially coming from someone whom Wikipedia describes as:

2011 winner of the ACM Turing Award, the highest distinction in computer science, “for fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning”

Resources expanding on the Controversy

The best articles I’ve gathered explaining his life’s work and how it relates to DL & ML are the following: this from Michael Nielsen (the same guy with the cool interactive Universal Approximation Theorem webpage) and this from Ferenc Huszár.

Yann LeCun responded to Mr. Pearl’s comments in a Bloomberg interview.

Other resources

Stanford’s Susan Athey has published a couple of papers using Machine Learning to aid her in understanding causality. In this paper, she outlines the way she thinks ML can help.

Resources in

The one time I saw Jeremy discussing this was in the ML course, specifically when he talked about partial dependence plot using Random Forests. (Do you remember any other instance, if so, please tell me)

What I think

I am an economist and thus my perspective comes from econometrics. There, people are obsessed with causality, to the point of studying non-important subjects to the detriment of the most pressing issues just because in the former we can say something about causality whereas in the latter is much more difficult.

The fundamental problem of causal inference is the impossibility of observing two different states for a given system. For example, when examining the effect of education on income, comparing people who went to college and people who didn’t won’t yield a causal answer because those two groups probably differ among many other dimensions beside their level of education. The ideal situation would be to study the life of each of us, alter our education level, and see how our income changes. Obviously, this is impossible (you cannot see what would my life have been had I not gone to high school). This non-observable situation is called a counterfactual.

Thus, the fundamental problem of Causal Inference is to estimate the counterfactual. For groups, the golden standard is a randomized trial (A/B testing, in techie lingua). However, randomized trials are not always available, and thus counterfactuals have to be estimated as any other quantity. If ML is the best tool we have to predict almost anything, is it the best tool we have to predict counterfactuals and do causal inference?

I do not know. The problem of confounders seems unassailable. And yet…

What do you think?

I am very interested to try understand the topic and there’s no better way to do so than discussing it with the community. What’s your take on all this?


This whole thing is new to me, so I do not know what I’m talking about. :wink: But the above reminds me of backpropagation: we use calculus to see how the influence is of each of the learnable parameters on the final outcome. It’s possible to slightly tweak each parameter while keeping the others constant, and observe the effect on the outcome (does it go up or down and by how much?). But this is a slow process since you need to do this for each parameter individually. Using derivatives this is much easier, as now we can do it for all parameters at once. Maybe there is a sort of calculus for doing this with causal relations too, so that we don’t actually have to examine each possible dimension of variation.

Just wanted to add that I’m about 3/4 of the way through The Book of Why, and I’ve gotten a lot out of it. The chapter on looking at statistical paradoxes (Monty Hall, Simpson’s paradox, etc.) through the lens of causality is fabulous, definitely worth buying the book for that chapter alone. Collider bias is one of the most interesting things I’ve learned about in a long time, and it’s so simple!

I’ve also had fun with this course on causal diagrams: The extra practice working with causal DAGs really complements the book.

About the controversy itself: I don’t really see it as being all that controversial (seems like what LeCun is saying). Deep learning is obviously great, but Pearl’s point is that causality is fundamentally an extra-statistical concept: causality uses probability and statistics (and I’m putting deep learning under that umbrella), but it isn’t reducible to probability and statistics. I think that’s a really exhilarating idea, and not something I had ever thought about before! But thinking that does’t mean you should turn off your GPU and stick to drawing causal DAGs on a whiteboard :slight_smile:

1 Like

Pearl developed such a calculus, which he called the “do calculus”. If you are interested, Pearl’s book summarizing much of this work is called Causality. I think Pearl is a genius, but I find his writing style a bit convoluted. A closely related resource that I found quite good is Daphne Koller’s MOOC on Probabilistic Graphical Models, unfortunately it seems that it is no longer free of charge.

I was really talking about the regular, capital C calculus. :slight_smile: But I’ve bought the Book of Why and am interested to learn more.

The key point is that data & calculation are insufficient for causal inference. You can find associations (for example by looking at partial derivatives, as you point out) but that is not the same as causal relationships.

Thanks for the link to the Harvard EdX MOOC, it’s really good!

1 Like