(Asutosh) #633

Hello @fumpen,

as Jeremy says in one of his lectures, we can’t use any of the test data for calibration. Think of it as you don’t have it until you’ve trained your model completely. Else, you can’t get true results.

(Vishal ) #634

Hello Everyone,

In Lecture 2, @jeremy explains how a decision tree is formed by selecting a variable and a split-point, at each step, which yields the lowest MSE (as per the naive model). Can someone please explain why exactly is this the splitting methodology? From another source, decision tree splitting is done using the ‘Information Gain’. How are these two (MSE and Information Gain) connected?

(Asutosh) #635

hello @vahuja4, think it this way

Information Gain = MSE at Root Node - Avg. MSE of the childs after splitting

So, IG will be more when Avg. MSE drops the most. Both are basically indicating the same thing only.

(Vishal ) #636

I see. But why is this termed to be ‘Information Gain’? Also, would you know why is this the chosen methodology for splitting?

(Asutosh) #637

@vahuja4

1. I think it is termed like that by convention because the more close your predictions come to actual values, you seem to have gained more information. And MSE basically denotes the gap between the actual values and model prediction. So, the closer the gap becomes(i.e. the more the MSE drops), it can be thought of as more information is gained.

2. As you know in DecisionTreeRegressor, the pred`iction at a node is given by taking the average of all the data points belonging to it. So, our ultimate goal is to make this average as close to the actual value.
So, we basically do a brute-force search of all possible splitting and check which one will give average closest to the actual values, and use MSE as a metric to measure the closeness.

(Vishal ) #638

@asutosh97, thank you! Makes sense.

Hey guys, if you want even a deeper understanding of your tree based models/xgb/sklearn etc, check this cool repos out,

What are your thoughts on this Jeremy?

(It’s really nice to interpret the Black Boxes properly…)

Both look Promising

(Asutosh) #640

looks interesting, where can I find documentations of using this?

Probably the notebooks are there…

The plots looks amazing…

https://nbviewer.jupyter.org/github/slundberg/shap/tree/master/notebooks/

(Kofi Asiedu Brempong) #642

I’m trying out techniques learnt in lessons 1 and 2 on the house prices kaggle competition

the training set has 1460 rows, should i still split it into 2 to get a separate validation set or should I just rely on oob_score?

(Kofi Asiedu Brempong) #643

I submitted my predictions to kaggle.
On my validation set, I had a root mean squared error of 0.0486755 but on kaggle my error was 0.14651 placing me around 2407 on the leaderboard.
Model is at

would be glad if you could have a look at it and help me improve my score.

(Vineeth Kanaparthi) #644

I just wish that you start international fellowship program for this course too and make it possible for international students to make the most out of it and also the part 2 of ML course. The first seven videos of ML1 is a gold mine for tree based models.

(Will) #645

You are over-fitting. Also it looks like your validation set or oob_score isn’t representative of the kaggle testing set. They way Jeremy recommends fixing this is try out 5 or so different validation sets on different ‘goodnesses’ of models and submit the results to kaggle. Plot the score on your different validation against your score on testing to compare the relationship. What your’re looking for is a roughly straight line that indicates improved performance on your testing set as your score improves on your validation set.

Some other ways to reduce over fitting is to increase your min_samples_leaf parameter to a higher number. Also reduce the max features per tree parameter to increase the diversity of the trees you’re creating.

Hope this helps!

#646

Regarding the discussion here https://youtu.be/3jl2h9hSRvc?t=635 on why OOB score in random forest would be lower than validation score, I understand Jeremy’s point, but I wonder if this is only true under the assumption that there is no/little overfitting on the training data? Overfitting on training data seems possible to me since out-of-bag data have been seen by at least some trees, while the validation set is entirely unseen by any trees.
Thanks for any help.

(Edit: I now see here https://youtu.be/3jl2h9hSRvc?t=1208 that Jeremy does mention that OOB could be better than validation score. But not sure if that is the same scenario as I mentioned above.)

(Utkarsh Mishra) #647

After doing the feature engineering in my training set. How do we bring the same changes in the test set like one hot encoding the columns or Parsing the dates from “SaleDate” to “Is_month_end”, "sale_month " and other changes we have brought to the training set. Should i merge the two set in the starting and then perform the feature engineering and after split them again when training ?
Or is there any other good way of doing it ?

#648

Is this the machine learning course referred to in Lecture 1, Part 1?

(Sanyam Bhutani) #649

The split should happen after feature engineering. Otherwise you won’t be able to run your RF on it.

Yes, this is the same.

(sid) #650

what was the fix for this do you remember ?

(sid) #651

So I watched lecture 1 and 2 and tried to enter this https://www.kaggle.com/c/house-prices-advanced-regression-techniques
using the similar code as mentioned in the tutorial but got less than stellar results. Has anyone tried with that dataset and what scores did you get with the same kind of code ?

(Vineeth Kanaparthi) #653

In the first lesson

In the bagging section of the notebook,

``````m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)
``````
``````preds = np.stack([t.predict(X_valid) for t in m.estimators_])
preds[:,0], np.mean(preds[:,0]), y_valid[0]
``````

im getting an error t.predict for t in m.estimators_

``````AttributeError                            Traceback (most recent call last)
<ipython-input-52-098c212805dc> in <module>()
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

<ipython-input-52-098c212805dc> in <listcomp>(.0)
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

AttributeError: 'numpy.ndarray' object has no attribute 'predict'
``````

The classifier I am using sklearns Gradient Boosting classifier, In the notebbok it is Random forest regressor.
It is not working for RandomForest regressor

Did anyone else encounter this problem? Please provide your views on this.

Also is there any workaround for accessing individual estimators for gradient boosting classifier?