Wiki / Lesson Thread: Lesson 6

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 5Wiki: Lesson 7 >>>

Lesson resources

Course Notes: under construction

Random forest interpretation techniques & review

  • Confidence based on tree variance
  • Feature Importance
  • removing redundant features
  • partial dependence
  • Tree interpreter
  • Extrapolation

I’ve just added the lesson video to the top post (currently uploading - will be available in ~30 mins).

@jeremy Could you please share the slides you showed us in this lecture? The one on ML applications in different industries.

Thanks for the reminder. I’ve added it to git in the ‘ppt’ folder.

There is something in this lesson that I would like to clarify.

In the paper written by Jeremy back in 2012 (Designing great data products), at the end there is a link to a YouTube video: Jeremy Howard - From Predictive Modelling to Optimization: The Next Frontier. Around minute 12:03, Jeremy says:

“One of the big insights I want you to take away from this is … really what you want is data that tells you about causality not correlation. … Generally speaking, you do not have data about causality, you’ve got data about business as usual.”

Then he goes on to explain how he convinced his client to conduct randomized experiments in order to collect data about causality.

But in lesson 6 I believe Jeremy doesn’t talk about conducting randomized experiments in order to collect data about causality.

So my question is: is it the case that the invention of partial dependence plots has replaced the need for conducting randomized experiments?

Unfortunately not.

1 Like

@jeremy I have a question on your experience with balancing theoretical rigor and beeing practical when it comes to causal relationships.
In the lesson around 15:45 you outlined how one could use feature importance to identify actionable features and PDP to generate simulations based on those features.
While this approach is appealing due to its simplicity, it seems that it can only be used if we are sure that the feature target relationship estimated by the PDP reflects the true causal relationship of the problem.
Since modeling causal relationships in complex business settings is notoriously difficult, my question is: How can we make sure that our simulator (via PDP) is built on the right causal assumptions and thus will lead to realistic scenarios for our data product?

There isn’t really any way to be sure, I’m afraid. You might be interested in this book about causality analysis:

There isn’t really any way to be sure, I’m afraid

Thanks a lot for the link! A follow-up question on the above: If there is no theoretical guarantee about the causal link, what’s the approach you would use to build your simulator on PDP (to be reasonably confident in the simulation)? How did you approach this in your previous business projects?

Randomized controlled trials! :slight_smile:

1 Like