Lesson 5 wiki

jeremy · November 10, 2017, 6:23am

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 4 ｜ Wiki: Lesson 6 >>>

Lesson resources

Lesson video
Lesson notes from @hiromi
Lesson notes thanks to @timlee
Notebooks are lesson2-rf_interpretation.ipynb and lesson3-rf_foundations.ipynb

Notes from @melissa.fabros

Resources:

waterfall chart python library by @cpcsiszar

Questions

Q: What should I be able to be doing at this point in the class?
That’s a great question! You should be able to replicate everything you’ve seen so far on a different structured dataset. We’ve looked at Bluebook for Bulldozer as our canonical example of a structured dataset. A structured dataset has rows and columns where every column represents a different thing/feature. Whereas unstructured data are something like images where a single pixel isn’t mapped to specific feature or label.

You should be able to import data and apply the RFR to the dataset and get a reasonable score.
You should be able to identify what’s are the important features in the dataset after RFR analysis, as well as Identify how confident you are about the prediction

Would you be able to identify what is the relationship between an independent variable and dependent variable, not in a messy univariate way. You’ll have to be able to explain what’s driving the outcome and how it’s driving the outcome with dependency plots; you’ll likely show that you can create a partial dependence plot.

More tools to help to interpret a Random forest Model

Waterfall Plots

Waterfall plots are very useful, and while it’s native to Excel and second nature to MBA students, Python doesn’t yet have a good library for it. But we hope that you pick up creating this library and become famous for do so. If you’d like to contribute to this project,
this forum might be the place for you

Tree Interpreter

Creating a Validation Set is the Most Important Thing You Can Do When Building a Model

Introduction to Object Oriented Programming in Python

(Hat tip to @parrt and @timlee for sharing notes! Thanks! )

melissa.fabros · November 15, 2017, 10:58pm

Hi @jeremy,

Is it still possible to edit the lesson 6 wiki? Or should I just update notes in the reply?

Thanks!
Melissa

jeremy · November 15, 2017, 11:07pm

Looks like I’d forgotten to make it a wiki @melissa.fabros! Fixed now.

afiodorov · April 20, 2019, 3:25pm

@jeremy I have been involved in fitting and interpreting RF’s for 2 years and I absolutely love this course: it talks about problems I have encountered myself and I love learning your approaches. One small thing: the treeintrepreter method has been shown to be inconsistent in the following paper: https://arxiv.org/abs/1802.03888. This means that features that are actually unimportant can somehow end up being important and vice versa. I would recommend educating about https://github.com/slundberg/shap library instead that doesn’t have such property.

jeremy · April 20, 2019, 4:12pm

Thanks @afiodorov. Yeah we have a lengthy diatribe about that issue:

However it’s a bit more nuanced than the level I was looking to go in this course - and in practice treeinterpreter works OK (and there are other approaches I prefer to fixing it other than SHAP).

prashanttz · October 7, 2019, 7:22pm

Hi @jeremy, in the explanation of the contribution we have sorted the scores and are talking about the contribution of each variable in general like variable y has negative 0.4 impact, but shouldn’t instead either of the following 2 should happen:

Since these are relative scores, so we should have have not sorted them and then instead of saying in general that a variable has -0.4 as impact we should say "variable y is having -0.4 OVER AND ABOVE the parent/previous variable i.e x ". and due to this it shouldn’t be sorted.
If we want to get a general sense , then the difference should be calculated from the original values (value of the root node), and not the previous/parent node, then we could have sorted without problem.

example:
mean of root: 10
mean of level 1 : 11
mean of level 2 : 10.6

here level 2 in current case would be -0.4 , but compared to the general population it is actually positive : +0.6

Thank You
Prashant
https://www.linkedin.com/in/prashanttz/

randy912 · February 18, 2020, 4:19pm

Are there videos or blog posts explaining these methods?