CA housing random forest regressor test with comparison with gradient boosting trees

parrt · October 26, 2017, 9:38pm

As Jeremy mentioned, it’s a good idea to keep a lab notebook about the things you try when trying to train models on data. Here is an example of me playing around with California housing data using the same random forest regress or we did today in class. For comparison, I also compare with the results of gradient boosting trees. Please note how easy it is to try out different models on the same data. Also note that I show how to split a data set into training and testing samples.

Please note that the random forest did a pretty good job out of the box without me thinking about the model or doing any massaging of the data.

kcturgutlu · October 26, 2017, 10:44pm

Are you using your own faster implementation of RF you have mentioned in class before. If so can we use it

parrt · October 26, 2017, 10:48pm

This is the standard scikit-learn version My version is in Java but I’ll be porting to Python (which will mean I’m slower than scikit as theirs is C underneath).

parrt · October 26, 2017, 10:54pm

BTW, @jeremy, my earlier experiment was all screwed up because I miss typed max_leaf_nodes instead of min leaf node size or whatever it is. doh! I think I also screwed up interpretation of oob_score_, thinking that was the error and not the score. haha.

tinapeng · October 27, 2017, 5:24am

Wow all the steps look super handy and intuitive… I thought the process should be much harder but now it looks pretty easy to implement different models.

parrt · October 27, 2017, 2:06pm

Yep, as @jeremy says, machine learning doesn’t have to be hard and scikit-learn makes it much simpler than earlier days. You just have to learn what to care about.

jeremy · October 27, 2017, 4:11pm

One thing to note @parrt is that gradient boosting trees are much more sensitive to hyperparameters than RFs. Generally, doing a grid search for hyperparams is necessary to get good performance, or manually carefully tuning. Here’s tips from former top ranked Kaggler Owen Zhang:

And here’s some more great tuning tips: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

jeremy · October 27, 2017, 4:13pm

One thing to note is that this dataset only contains continuous variables, and no missing values, so @parrt was able to skip the steps where we handled these two issues.

They’re not hard either, mind you - just 3 lines of code!

parrt · October 27, 2017, 4:22pm

Good to know! Thanks.

jeremy · October 27, 2017, 4:28pm

Actually this is probably an even better slide from Owen:

And here’s 2 examples of grid search:

BTW I believe that @yinterian will be covering quite a bit of gradient boosting in part 2 of the course.

kkibrahim · October 28, 2017, 4:56am

Could anyone explain how to use train_cats()? I keep getting the following error when trying to use this function.

AttributeError: ‘NoneType’ object has no attribute ‘items’

Thoughts on pd.get_dummies()? My jupyter notebook crashes whenever I try to create a dummy variables using this function on features that have many many categories.

kcturgutlu · October 28, 2017, 5:19am

For those who want to learn more about Xgboost ans it’s capabilities https://arxiv.org/abs/1603.02754. @jeremy What are your thoughts on CatBoost vs Xgb?

parrt · October 28, 2017, 2:28pm

train_cats() is only needed with categorical values and CA has none, which also means you don’t need get_dummies()

jeremy · October 28, 2017, 4:05pm

Generally speaking you shouldn’t need to use get_dummies() with tree-based methods - and almost certainly not if there are lots of categories. (As Terence mentioned, there aren’t any categorical vars in this dataset anyway).

kkibrahim · October 30, 2017, 1:20am

Sorry, I should have clarified that my question pertains to a different data set that does have categorical variables. Regardless, train_cats() wasn’t working for me and neither was get_dummies() because there were too many “categories” (such as songID) within a couple features – get_dummies() would crash on those. I’m a little stuck on how to move forward. Any suggestions?

jeremy · October 30, 2017, 1:23am

@kkibrahim if you could share your notebook, we can take a look. Not really possible to debug any problem without seeing all the steps that lead to it. FYI there’s a great “gist it” extension that makes it easy to share your work as a public gist: http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/gist_it/readme.html

kkibrahim · October 30, 2017, 2:07am

Thanks for taking a look! My code is here: https://github.com/khouryibrahim/kaggle/blob/master/recs3.ipynb

In[11] is where the issue starts.

jeremy · October 30, 2017, 3:34am

You manually set the fields to category type, so no need for you to call train_cats. train_cats changes string types to category types, and assumes that you don’t already have category types.

kkibrahim · October 30, 2017, 3:43am

Well, for example, I still get this error when running RandomForestRegressor:

ValueError: could not convert string to float: female

when gender is one of the features I set as a category. I’ve tried various approaches and can’t see to get RandomForestRegressor to run

jeremy · October 30, 2017, 5:50am

That means you haven’t run proc_df, which replaces the category fields with their integer codes.