Another treat! Early access to Intro To Machine Learning videos

Yeah I think that’s a reasonable approach - you could even create a subclass which has the RF capabilities but with this extra capability.

1 Like

Thanks – I think the following code may combine the Random Forest capabilities with the extra max_samples capability:

from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor

class RandomPatchesRegressor(BaggingRegressor):
    def __init__(self, 
                 n_estimators=10, 
                 max_samples=1.0, 
                 max_features=1.0, 
                 bootstrap=True, 
                 bootstrap_features=False, 
                 oob_score=False, 
                 warm_start=False, 
                 n_jobs=1, 
                 random_state=None, 
                 verbose=0, 
                 criterion='mse', 
                 splitter='best', 
                 max_depth=None, 
                 min_samples_split=2, 
                 min_samples_leaf=1, 
                 min_weight_fraction_leaf=0.0, 
                 max_leaf_nodes=None, 
                 min_impurity_split=1e-07, 
                 presort=False):
        
        base_estimator = DecisionTreeRegressor(criterion=criterion, 
                splitter=splitter, 
                max_depth=max_depth, 
                min_samples_split=min_samples_split, 
                min_samples_leaf=min_samples_leaf, 
                min_weight_fraction_leaf=min_weight_fraction_leaf, 
                random_state=random_state, 
                max_leaf_nodes=max_leaf_nodes, 
                min_impurity_split=min_impurity_split, 
                presort=presort)
        
        BaggingRegressor.__init__(self, 
                base_estimator=base_estimator, 
                n_estimators=n_estimators, 
                max_samples=max_samples, 
                max_features=max_features, 
                bootstrap=bootstrap, 
                bootstrap_features=bootstrap_features, 
                oob_score=oob_score, 
                warm_start=warm_start, 
                n_jobs=n_jobs, 
                random_state=random_state, 
                verbose=verbose)
1 Like

Nice! Have you tried it? Does it give the same results as random forest, if you use the whole dataset? What’s missing compared to the sklearn RF?

Presumably this has the same problem with OOB using too much data, for a large dataset?

I’m embarassed, I should have tested before posting. Unfortunately it doesn’t give identical results as random forest even if you use the whole dataset. Here’s a small Jupyter Notebook to demonstrate:

I think we can call that “close enough”! :slight_smile: The biggest issue is that this approach misses out on key functionality (e.g. I noticed that feature importance isn’t supported).

1 Like

Good catch about feature importance, and thanks for your continued feedback! So I just added feature importance - but other key functionality may still be missing. Maybe this is a dead end, but it’s been an interesting experiment to learn from!

2 Likes

I just added lesson 4’s video to the top post (note that it’s still uploading as I type this - so if you don’t see it, try again in 30 mins).

9 Likes

I did not quite understand why feature importance changes a lot when we one hot encode categorical variable. Partial Dependency plot and Tree Interpretor section was really informative. After finding out that sales price dip in between 1990-95 is due to other factors and if I have suspicion that it is due to X variable. lets say if I fix X variable then and run random forest again with real year made values to generate a plot. If that plot looks linear, can I establish the cause effect of X variable on the price dip?

Just a little question here regarding RF. I saw that when you encode categorical features as continuous variables it does not make sense for linear models as it would “think” that some categories are higher or lower than others so we one hot encode them.
Example with a color feature of values red, green, blue:
Continuous encoding gives:

| color |
|   1   |
|   2   |
|   3   |

One hot encoding gives:

| color_red | color_blue | color_green|
|     1     |     0      |     0      |
|     0     |     1      |     0      |
|     0     |     0      |     1      |

But when it comes to RF from intuition (and from @jeremy 's videos) I feel like encoding variables as continuous for RF looks not to be a problem as the trees will eventually splits on values which are greater/lower than the mean of this feature. Could you confirm that? Thanks :slight_smile:

Yes, RF will eventually split the values but Jeremy mentioned that this approach is computationally more expensive (requires a lot of split versus two split in case of one hot encoding) and also you are training on less data as you are fragmenting your data across many splits on different levels.

Not exactly - it really depends on the variable. I’ve found in practice generally avoiding one-hot encoding for all variables gives better accuracy.

1 Like

@jeremy One thought I have had is it would be nice to have a forum to discuss the ml1 course. Something similar to the beginner forum just created but for ml1 instead?

Yeah there is one, but it’s just for the masters students. So let’s use this thread to discuss it, if that’s OK…

Sure that works too.

I’m just glad to have the content at least. :slight_smile:

@Ekami, the answer to that would be a bit long. I commented on a -long- thread in Kaggle some quite a complete posts two months ago, here is the link https://www.kaggle.com/c/zillow-prize-1/discussion/38793#217608

Its quite a long thread but I think can give you quite a complete idea about categorical variables encoding in tree based models. Also there are links inside to a some very good articles.

One only caveat: the thread doesn’t include the “embedding” approach. The reason, well, I haven’t used the embedding approach by now and to say that I understand it 100% would be… well, not true, so couldn’t say much about it. Anyway, that thread has become quite visited and I think it’s very informative on the subject.

And… Im going to watch the lesson 4 right now! I can not keep pace with so much great sutff!!! :sweat:

4 Likes

And… just finished!

This lessons are fantastic. No matter if beginner or if you are already a “heavy” random forest user… many things to learn!

Its funny, I was never considering myself OHE categoricals (for the reasons comented in the cited thread)… but I had overlooked the big reason to consider OHE with low cardinality: Detect importance of individual levels! How can that have skipped my mind? :man_facepalming:

Once again, thank you Jeremy for sharing all this…! :grinning:

2 Likes

When growing a tree, why does it make a difference in what order we split things (using the whole information gain, etc?)?

If I have two features both having two possible values A, A* and B, B*, it does not matter whether I first split on A and then split on B or vice versa? I still should get the same result?

OR is what we are after to have as pure leaves as possible? Meaning, splitting is really just a computationally hard problem to solve and we use information gain as a proxy for making optimal decisions and the outcome that we are after is having maximally pure leaves? So, the ideal situation would be that if I take path LeftRightRL I end up at a leaf that is 100% of some class?

@radek, I think at least part of the answer to that is that very frequently trees will not be grown to the max. depth, and you are not using all the possible features on each level so you will not have in the same tree A and A and B and B* from begining to end of the tree. There is a hierachy in the importance and if you split “the most important first” your trees are going to be as strong predictors as randomness and depth allows, even if not completely grown to max depth, or not seing all the features.

1 Like

I wonder why if I am building a single tree I might not want to split it on everything that I can - I think I remember watching something about that you might want to do this to not overfit, but can’t recall.

Order of splits can matter if, say we always split on some particular variable first, the rest of the tree now has only half as much data to do further splits. Remember we’re limited to log2(n) binary splits per full decision, so we don’t want to waste any!