Feeling unfulfilled pushing 98% accuracy on State Farm

jeremy · November 22, 2016, 11:33pm

As with most things, Leo Breiman said it much better than I can, so I’ll start by referring you all to this paper, which I think is perhaps the most important paper in data science written in the last 20 years: http://projecteuclid.org/euclid.ss/1009213726 .

Basically, the contention is this: given the choice between a simple “interpretable” model, and a complex but extremely accurate model, I’d take the latter every time. Because if it’s really accurate, we can ask it questions, like:

What would you predict if I gave you this data (to which I already know the answer)?
Which groups would you give similar predictions to?
If I vary just one data item up and down, how do your predictions change?
If I randomly shuffle one data item, how much less accurate do you become?

These kinds of questions allow us to get comfortable that a model is accurate under a range of conditions, understand what clusters it sees in the data, see how each input variable impacts the outcome of the model (using a partial plot), and see which variables are most important (using a variable importance plot). These are just a few example of the kinds of ways we can interrogate an accurate model.

I agree with Rachel that we still have a lot to learn about how best to understand (and debug) these kinds of models. But I think that Breiman’s basic insights hold, and that gives me a lot of confidence that the answers exist - we just have to find them.