Another treat! Early access to Intro To Machine Learning videos

The split should happen after feature engineering. Otherwise you won’t be able to run your RF on it.

Yes, this is the same.

1 Like

what was the fix for this do you remember ?

So I watched lecture 1 and 2 and tried to enter this https://www.kaggle.com/c/house-prices-advanced-regression-techniques
using the similar code as mentioned in the tutorial but got less than stellar results. Has anyone tried with that dataset and what scores did you get with the same kind of code ?

In the first lesson

In the bagging section of the notebook,

m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)
preds = np.stack([t.predict(X_valid) for t in m.estimators_])
preds[:,0], np.mean(preds[:,0]), y_valid[0]

im getting an error t.predict for t in m.estimators_

AttributeError                            Traceback (most recent call last)
<ipython-input-52-098c212805dc> in <module>()
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
      2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

<ipython-input-52-098c212805dc> in <listcomp>(.0)
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
      2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

AttributeError: 'numpy.ndarray' object has no attribute 'predict'

The classifier I am using sklearns Gradient Boosting classifier, In the notebbok it is Random forest regressor.
It is not working for RandomForest regressor

Did anyone else encounter this problem? Please provide your views on this.

Also is there any workaround for accessing individual estimators for gradient boosting classifier?

The estimators_ attribute for random forests is a list of decision trees. However, for gradient boosting, estimators_ is an numpy array with an additional axis. Think of it as list of lists.

To access the corresponding estimator, you just have to access the corresponding element of the list, e.g. t[0].predict instead of t.predict in the case of a binary classifier.

Disclaimer: I have not tested this, but it should work

I got a score of 0.141

I get you but my problem is how to get the different validation sets, the data is not a time series so i am randomly selecting 20% of it for the validation set.
Can you think of any other ways i could construct the validation set?

Mind sharing how you achieved it. I got a 0.144

Sure…
Did not do anything substantially different, the only new thing i think was the fact that i removed some outliers as was suggested in some of the kernels.
You can find it on github. Link to the model

Hi @jeremy It was an informative tip to use permutation importance over default sklearn one. I have a couple of queries related to it.

  1. After plotting the permutation importance features, do I have to remove the features that are below the ‘random’ variable? Because I get a bunch of them below the ‘random’ feature.

  2. In default importance, It is advisable we just look at the features that have significance value greater than 0.005. Does the same rule apply for permutation importance. Can I have threshold of the value?

In the function proc_df , whats the purpose of making new columns (feature_na) for the features that have missing values and filling it with either 1 and 0 ?
How does it help our model in predicting better solution ?

@utksh Remember, The model only works on numbers.

1 Like

Yes, I know that.
When we are fixing missing values in numeric , we are creating new column _df[column+'na’] and we fill it with boolean column with 1 if there was missing value and 0 if value is not missing…
So, why do we make this new column ? Since we already fixed the missing values by replacing it with the median.

@utksh hey, good question.

I guess Jeremy discusses in one of his later lectures that in one of his Kaggle contests, he had to predict whether a student got admission in college or not based on some data. (I don’t remember whether this was exactly the problem or sth similar was). It turned out to be so that, the college administration was lazy to fill some data about students who didn’t get admission. So, the absence of data was the most relevant information there. We’ll miss this if we simply impute them. That’s why these extra columns store these datas, which could be further used by the model.

Hope this helps.

1 Like

hey @yang-zhang, I have written an answer with pseudo-codes of oob_score and set_rf_samples.
Have a look :- Another treat! Early access to Intro To Machine Learning videos

It may help.

hey @utksh,

This also I think Jeremy explained at the beginning of the 2nd lesson.
In order to make sure that the imputations in train and test be same, the proc_df returns a dictionary named nas which basically contains column_name as keys and median_value as values which were imputed, so that these values are used while doing imputation on test set, and not using median of test set separately.

You may also think that let’s say there is a column, say age which has missing values only in test set. In that case, the imputation is done by finding the median of test set only as we don’t have any option. And also, it is made sure that age_na feature is not added because that will cause inconsistency in the number of features of train and test sets. (see proc_df source code).

Hope this helps.

1 Like

Hi,

I have followed the Normal Install Instructions from here.

After that, I don’t know what to do? Can anyone help?

Thanks,
Sumit

Thanks man :slight_smile:
That helped.

@asutosh97 thank you for replying to my question. I could not find pseduo-codes in the link you provided. maybe you meant a different link?

@yang-zhang
I’ve updated the link, try now.