In the chapter 9 of Fastbook about partial dependence when it says:
Instead, what we do is replace every single value in the YearMade column with 1950, and then calculate the predicted sale price for every auction, and take the average over all auctions. Then we do the same for 1951, 1952, and so forth until our final year of 2011. This isolates the effect of only YearMade (even if it does so by averaging over some imagined records where we assign a YearMade value that might never actually exist alongside some other values).
it is not clear what values are replaced for each year. Can somebody explain it in other words?
in chapter 9 when it says:
The fact that there are two variables pertaining to the “model” of the equipment, both with similar very high cardinalities, suggests that they may contain similar, redundant information. Note that we would not necessarily see this in the dendrogram, since that relies on similar variables being sorted in the same order (that is, they need to have similarly named levels). Having a column with 5000 levels means needing a number 5000 columns in our embedding matrix, so this would be nice to avoid if possible. Let’s see what the impact of removing one of these model columns has on the random forest:
why it removes fiModelDescriptor and not fiModelDesc? Maybe that is a mistake in the book?
Say YearMade has 5 unique values: 1950, 1960, 1970, 1980, 1990.
Here what we do, in pseudocode.
for year in YearMade.unique():
df = df_original.copy()
df.YearMade = year
preds = random_forest.predict(df)
Basically, we replace the entire YearMade column in a loop with all its possible values and run predictions on top of this made-up dataset.
This allows answering the following, very powerful question: how much would that piece of equipment have cost, had it been produced in 1950, all other things being equal? And in 1960? And in 1970?
E.g. we can isolate the effect of YearMade and figure out the impact of it on the dependent variable, controlling for interactions with all the other variables.
Beware of Partial Dependency Plots limitations though! They are pretty much worth nothing in case of highly correlated features. More details here.
I agree this looks like a mistake. Had noticed the same. Maybe we should file a GitHub issue?
I understand how the replacement is done but I do not understanding yet is how random forest is able, with that procedure, to tell us which years have higher influence
This table lets you state that price_of_equipment increases linearly with year_of_production.
If you think about it, this is quite powerful a statement.
You might ask: wait, I already had price_of_equipment and year_of_production in my original dataset. Why can’t I just produce a scatter plot of one VS the other? Why do I need RF at all?
The reason you CAN’T do the above and that you NEED a model is that plotting price_of_equipment VS year_of_production would show a univariate relationship between these variables. You don’t want that.
A univariate relationship does not take ALL the other variables into account, which is what a model (especially good non-linear ones) does by default.
A univariate plot would answer the question: what is the relationship between price_of_equipment VS year_of_production?
The real question to ask is, instead: what is the relationship between price_of_equipment VS year_of_production all other things being equal?
Again, below is a true piece of gold. Go through it