Another treat! Early access to Intro To Machine Learning videos

Mainly so that our feature importance plots are easier to interpret. Otherwise the importance measures can be split over multiple related features. Also, lots of highly related variables can mean that they are over-represented in the random sets of features we pick at each level.

1 Like

There’s a horrible horrible python ‘feature’ that uses the exact same dict or list object if you put a dict or list as a default parameter. It leads to incredibly confusing bugs. So never put any kind of object as a default param!

5 Likes

Merging at the start, using pandas, is normally the best approach. Although this particular competition seems a little different - it’s essentially a collaborative filtering problem. So for my initial analysis I haven’t merged any tables!

2 Likes

Just added lesson 5 video to the top post.

8 Likes

If I am reading this right, having correlated columns can lead to our trees repeatedly choosing to split on what those columns represent (even with the param for showing it a subset of columns to chose from) and we might not get a chance to explore other, potentially useful splits?

Would this be an argument for additional preprocessing - doing PCA or something like that?

Slightly, yes, but really it’s the interpretation issue that’s important.

No, that totally kills interpretability, and also using a linear preprocessing approach can destroy the signal that a nonlinear model like RF can find.

Instead, simply change max_features to a higher number :slight_smile:

1 Like

@radek, maybe I’ve found the way to end that head banging with this doubt: Please, consider this example, just slightly more complex than your A A’ B B’ example (that is too simple in this case and is probably the root of the problem -notice the joke with “root”? Ok…

1: Dont’t think trees, think about only one decision tree.
2: Don’t think of logs, think about a dataset of 4 persons
3: The target is “life expectancy”. 3 features are, “age is > 70”, “smoker”, "likes pink color"
4. Tell me if order matters. :grinning:

2 Likes

Here is a very good explanation of what Jeremy is talking about. Thanks a lot for this very important information @jeremy I didn’t know that!

2 Likes

thank you @miguel_perez, this is awesome :slight_smile:

Ahhh I see what you did there :slight_smile: We can’t construct a full tree!!! log2(4) = 2! So it makes a difference what we fit on first - we only get two splits. Potentially the order of “age is > 70” and “smoker” doesn’t matter? Both trees regardless which split we start with should have equivalent MSE?

1 Like

Hahaha! amazing you were able to find a subset of my example in wich order doesnt matter!!! :joy:
So true, you are right about the subset… but (I hope) the example proves that order does matter in trees. (Hope so cause that example is my best shot) :slight_smile:

1 Like

Could you tell me more about this? I’m not sure to understand this collaborative filtering thing. Do we not merge at start because:

  • We want to avoid ending up with a very big file? In that case we find the features that we are most interested in before merging?

  • Or is it because the tables merge are a bit particular to merge?
    For instance in my kernel when I had to merge holidays_events I got rid of the holidays which were not in the Ecuador local_name column to first get rid of duplicated holidays (multiples rows with the same date) and also allow the merge to happen (as I’m merging on date which need to have unique values). But this decision should be taken at “feature engineering” level (after merging) as I got rid of features which could hold important informations.

Thanks :slight_smile:

Yes, I think the example works - thank you very much :slight_smile:

I am curious - but if we had more data, say 8 people :wink: , it wouldn’t hurt us to split on all 3 levels? I mean here it probably would, since with that little data liking pink might be noise that we would be fitting…

But with many trees and much more data, and giving the trees the ability to only look at a subset of features to split on (so that we explore all potential splits), etc, we never have to worry about filtering out data manually? Basically it is as if random forest was specifically designed to filter signal from noise and via us doing things manually we would probably not add much value?

Oh I just found this here:

Another nice feature of decision trees built through CART is that they automatically put aside the non-important variables (only the best splitters are selected at each split). In the seminal book by Hastie et al. (2009), the authors showed that with 100 pure noise predictors, and 6 relevant predictors, the relevant variables were still selected 50% of the time at each split. So you really don’t need to worry about variable selection in RF.

Thank you very much again @miguel_perez for the example, was really helpful :smiley:

3 Likes

In the lectures, I think @jeremy says something along the lines:

oh we need to take the log of the price as they only care about the ratio

Indeed, logarithms seem to have this funny property:

Is this related to the taking the log of the price? What does taking the log of the price help with?

The trees are massive! Somehow when you watch @jeremy run everything so quickly on the computer you don’t realize this…

I started playing around a bit more with lesson 1 notebook. Estimators utilize a Tree instance behind the scene and it has some interesting instance variables.

For example, with just 20k training examples m.estimators_[0].tree_.node_count evaluates to 21877! The max_depth is 30, where log2(20_000) is only 14.29. If we were always splitting the data into two equally sized groups, the depth of 15 is what we would have maximally reached for some of the leaves. So the tree that we build can end up having some leaves that are very close to the root and some that are very far from the root.

This came as a surprise to me so thought I would share :slight_smile: Sort of had this image of a nice little graph in my head and here a tree gets to completely lack symmetry and have more nodes than there are samples in the data! (no surprise there now that I think of it as one leaf == one example and we need to have the nodes we split on, but 21877 - 20k = 1877 so we have 1877 nodes do the splitting! that is a lot!)

2 Likes

After reading the articles you shared on Kaggle this “one hot encoding stuff for DT/RF” became much more clear and it confirms the doubts I had about encoding features for these kind of algorithms. Thanks a lot :slight_smile:

1 Like

Yes, that’s generally why people are most interested in log() for dependent variables where predicting the ratio accurately is more important than predicting the difference.

Yeah it’s just that it’s such a big file in this case. Other than that, there’s no harm.

We’ll be studying collaborative filtering soon!

1 Like

What do you mean by predicting the difference vs predicting the ratio?

Sorry, this has to be something super basic - I tried googling for this but all I get is I think people saying that we take the log of the dependent variable so that it meets the assumptions of linear regression better.

1 Like

Glad to hear that @Ekami!

Aditionally to the reasons not to use OHE also notice a reason to actually use OHE that I realized here thanks to Jeremy’s explanation in last week’s lesson, that is, assesing the importance of individual levels (as long as there are not many), that I commented in this same thread.

And thanks for your feedback, happy to be useful ! :grinning:

I think this is an interesting enough question that I’ll cover it in the next class - if I remember!

1 Like