Lesson 7 - Official topic

Similar in a sense that they are small sample of the data, but different because the mini-batches within an epoch are all different, but the random samples used to train each tree could have overlapping records/ data points. Also i think random forest also sample a subset of the features for each tree instead of using all of the features, and the mini-batch should use all of the features (please correct me if i’m wrong).

At the end of the day it is almost the same, even though not fully comparable.
For out-of-fold error, the data points you are calculating it onto, pass through EVERY tree in the ensemble, whereas in OOB this is not guaranteed.
In a sense OOB is more overfit, still it gives a good perspective.
And it does it for free.

@jeremy, does that method of measuring featuring importance end up being roughly the same as shuffling the values in a column and measuring how much the model’s performance degrades?

1 Like

Yes. This is correct. It gives more or less the same results.
I actually like your approach better.

How is this true? I don’t think it is. Is there something I am missing?

To clarify, I guess Bagging with 5 decision trees is equivalent to 5-fold CV of decision tree.

Side note: When we are doing time based train/valid split(ex: future sale predictions), OOB score is less good because it shuffles the data.

3 Likes

Is cluster_columns doing something like a hierarchical clustering? The plot looks very similar…

2 Likes

Yep, agreed.

I think we are briefly going to discuss the concept behind boosting in this chapter (so this lesson or the next)

1 Like

The results and technical aspects of creating the model are interesting but I am wondering about the utility of the interpretation. If the goal was to predict the sale price of the bulldozer, the tree seems to indicate that the newer, larger vehicles fetch the higher price. This seems like a lot of work to state confirm what is intuitive when selling used vehicles.

It’s great that your intuition aligns with what the model found in this simple case! Unfortunately, often it’s quite difficult to discern which variables are most important, so you can imagine that in other cases this would be quite valuable.

2 Likes

What I meant is that you generally calculate your out-of-fold error using the entire Random Forest, trained on the rest of the dataset.

Yes, I agree. I guess I replied too fast and didn’t think correctly at the depth of the point you raised.

1 Like

Ah OK makes sense. Glad we agree :slight_smile:

1 Like

Sure but we’re talking about this case. Does this mean we need to frame a better question about predictions?

I think that the ability to understand and compare dependence on other, subdominant variables (such as ProductSize) is illuminating.

It may mean that there are better (or additional) questions to ask!

I consider interpretable ML (together with fairness and bias in AI) as one of the most fascinating and crucial aspects of being an effective Data Scientist.
A while ago I put together this tutorial with a review of the latest techniques/algorithms/approaches to “look” into a model.

Side note: this is a true gem. Cannot recommend it enough. And it covers NN interpretation too!

13 Likes

On that related note, PyTorch has a great package for interpretable neural networks here:

9 Likes

Isn’t partial dependence a better way of understanding variable importance? (better than the scikit learn automated way of doing it based on tree splits?)

Hi MJB Hope all is well!

The amazing part is we are using software to model intuition, also even if you have no previous intuition or domain experience you can apply these techniques and probably compete with others who have many years of experience.

Cheers mrfabulous1 :grinning: :grinning:

This is the original blog post from Ando Saabas who, back in 2014, invented the treeinterpreter approach.
True genius.

His entire blog is a goldmine btw.

5 Likes