Another treat! Early access to Intro To Machine Learning videos


(Sanyam Bhutani) #649

The split should happen after feature engineering. Otherwise you won’t be able to run your RF on it.

Yes, this is the same.


(sid) #650

what was the fix for this do you remember ?


(sid) #651

So I watched lecture 1 and 2 and tried to enter this https://www.kaggle.com/c/house-prices-advanced-regression-techniques
using the similar code as mentioned in the tutorial but got less than stellar results. Has anyone tried with that dataset and what scores did you get with the same kind of code ?


(Vineeth Kanaparthi) #653

In the first lesson

In the bagging section of the notebook,

m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)
preds = np.stack([t.predict(X_valid) for t in m.estimators_])
preds[:,0], np.mean(preds[:,0]), y_valid[0]

im getting an error t.predict for t in m.estimators_

AttributeError                            Traceback (most recent call last)
<ipython-input-52-098c212805dc> in <module>()
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
      2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

<ipython-input-52-098c212805dc> in <listcomp>(.0)
----> 1 preds = np.stack([t.predict(X_valid) for t in clf.estimators_])
      2 preds[:,0], np.mean(preds[:,0]), y_valid[0]

AttributeError: 'numpy.ndarray' object has no attribute 'predict'

The classifier I am using sklearns Gradient Boosting classifier, In the notebbok it is Random forest regressor.
It is not working for RandomForest regressor

Did anyone else encounter this problem? Please provide your views on this.

Also is there any workaround for accessing individual estimators for gradient boosting classifier?


#654

The estimators_ attribute for random forests is a list of decision trees. However, for gradient boosting, estimators_ is an numpy array with an additional axis. Think of it as list of lists.

To access the corresponding estimator, you just have to access the corresponding element of the list, e.g. t[0].predict instead of t.predict in the case of a binary classifier.

Disclaimer: I have not tested this, but it should work


(Kofi Asiedu Brempong) #655

I got a score of 0.141


(Kofi Asiedu Brempong) #656

I get you but my problem is how to get the different validation sets, the data is not a time series so i am randomly selecting 20% of it for the validation set.
Can you think of any other ways i could construct the validation set?


(sid) #657

Mind sharing how you achieved it. I got a 0.144


(Kofi Asiedu Brempong) #658

Sure…
Did not do anything substantially different, the only new thing i think was the fact that i removed some outliers as was suggested in some of the kernels.
You can find it on github. Link to the model


(Arul Bharathi) #659

Hi @jeremy It was an informative tip to use permutation importance over default sklearn one. I have a couple of queries related to it.

  1. After plotting the permutation importance features, do I have to remove the features that are below the ‘random’ variable? Because I get a bunch of them below the ‘random’ feature.

  2. In default importance, It is advisable we just look at the features that have significance value greater than 0.005. Does the same rule apply for permutation importance. Can I have threshold of the value?


(Utkarsh Mishra) #660

In the function proc_df , whats the purpose of making new columns (feature_na) for the features that have missing values and filling it with either 1 and 0 ?
How does it help our model in predicting better solution ?


(Sanyam Bhutani) #661

@utksh Remember, The model only works on numbers.


(Utkarsh Mishra) #662

Yes, I know that.
When we are fixing missing values in numeric , we are creating new column _df[column+'na’] and we fill it with boolean column with 1 if there was missing value and 0 if value is not missing…
So, why do we make this new column ? Since we already fixed the missing values by replacing it with the median.


(Asutosh) #663

@utksh hey, good question.

I guess Jeremy discusses in one of his later lectures that in one of his Kaggle contests, he had to predict whether a student got admission in college or not based on some data. (I don’t remember whether this was exactly the problem or sth similar was). It turned out to be so that, the college administration was lazy to fill some data about students who didn’t get admission. So, the absence of data was the most relevant information there. We’ll miss this if we simply impute them. That’s why these extra columns store these datas, which could be further used by the model.

Hope this helps.


(Asutosh) #664

hey @yang-zhang, I have written an answer with pseudo-codes of oob_score and set_rf_samples.
Have a look :- Another treat! Early access to Intro To Machine Learning videos

It may help.


(Asutosh) #665

hey @utksh,

This also I think Jeremy explained at the beginning of the 2nd lesson.
In order to make sure that the imputations in train and test be same, the proc_df returns a dictionary named nas which basically contains column_name as keys and median_value as values which were imputed, so that these values are used while doing imputation on test set, and not using median of test set separately.

You may also think that let’s say there is a column, say age which has missing values only in test set. In that case, the imputation is done by finding the median of test set only as we don’t have any option. And also, it is made sure that age_na feature is not added because that will cause inconsistency in the number of features of train and test sets. (see proc_df source code).

Hope this helps.


(Sumit) #666

Hi,

I have followed the Normal Install Instructions from here.

After that, I don’t know what to do? Can anyone help?

Thanks,
Sumit


(Utkarsh Mishra) #667

Thanks man :slight_smile:
That helped.


#668

@asutosh97 thank you for replying to my question. I could not find pseduo-codes in the link you provided. maybe you meant a different link?


(Asutosh) #669

@yang-zhang
I’ve updated the link, try now.