Two questions about chapter 9 of Fastbook

enr · May 4, 2020, 5:12am

In the chapter 9 of Fastbook about partial dependence when it says:

Instead, what we do is replace every single value in the YearMade column with 1950, and then calculate the predicted sale price for every auction, and take the average over all auctions. Then we do the same for 1951, 1952, and so forth until our final year of 2011. This isolates the effect of only YearMade (even if it does so by averaging over some imagined records where we assign a YearMade value that might never actually exist alongside some other values).

it is not clear what values are replaced for each year. Can somebody explain it in other words?

in chapter 9 when it says:

The fact that there are two variables pertaining to the “model” of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this in the dendrogram, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5000 levels means needing a number 5000 columns in our embedding matrix, so this would be nice to avoid if possible. Let’s see what the impact of removing one of these model columns has on the random forest:

why it removes fiModelDescriptor and not fiModelDesc? Maybe that is a mistake in the book?

FraPochetti · May 4, 2020, 6:09am

Say YearMade has 5 unique values: 1950, 1960, 1970, 1980, 1990.
Here what we do, in pseudocode.

for year in YearMade.unique():
    df = df_original.copy()
    df.YearMade = year
    preds = random_forest.predict(df)

Basically, we replace the entire YearMade column in a loop with all its possible values and run predictions on top of this made-up dataset.
This allows answering the following, very powerful question: how much would that piece of equipment have cost, had it been produced in 1950, all other things being equal? And in 1960? And in 1970?
E.g. we can isolate the effect of YearMade and figure out the impact of it on the dependent variable, controlling for interactions with all the other variables.
Beware of Partial Dependency Plots limitations though! They are pretty much worth nothing in case of highly correlated features. More details here.

I agree this looks like a mistake. Had noticed the same. Maybe we should file a GitHub issue?

enr · May 4, 2020, 7:45am

For the first case, if we have a table with some values like this:
year, value
1950, 4
1950, 3
1960, 7
1980, 9
1990, 2

does it mean that we run the random forest for each year where we exclude the other values for the other years?
or how do you replace those numbers?

FraPochetti · May 4, 2020, 7:50am

Nope, it means you replace the entire column by one year at a time, and run random forests on top of the resulting dataset.
E.g.

replace column year with 1950
year, value
1950, 4
1950, 3
1950, 7
1950, 9
1950, 2
replace column year with 1960
year, value
1960, 4
1960, 3
1960, 7
1960, 9
1960, 2
replace column year with 1970
year, value
1970, 4
1970, 3
1970, 7
1970, 9
1970, 2
keep going for all unique values of year

Makes sense?

enr · May 4, 2020, 8:02am

I understand how the replacement is done but I do not understanding yet is how random forest is able, with that procedure, to tell us which years have higher influence

FraPochetti · May 4, 2020, 8:36am

At each iteration of the loop you:

replace the entire column with a specific year
rerun Random Forest (this means RF will see all the other variables unaltered, except for year)
take the average of the predicted dependent variable

Let’s assume we are doing regression (again on price_of_equipment).
The end result might look like the following.

year | average_prediction_price_of_equipment
1950 | 2.1
1960 | 3.0
1970 | 4.3
1980 | 5.1
1990 | 6.4

This table lets you state that price_of_equipment increases linearly with year_of_production.
If you think about it, this is quite powerful a statement.

You might ask: wait, I already had price_of_equipment and year_of_production in my original dataset. Why can’t I just produce a scatter plot of one VS the other? Why do I need RF at all?

The reason you CAN’T do the above and that you NEED a model is that plotting price_of_equipment VS year_of_production would show a univariate relationship between these variables. You don’t want that.
A univariate relationship does not take ALL the other variables into account, which is what a model (especially good non-linear ones) does by default.
A univariate plot would answer the question: what is the relationship between price_of_equipment VS year_of_production?
The real question to ask is, instead: what is the relationship between price_of_equipment VS year_of_production all other things being equal?

Again, below is a true piece of gold. Go through it