As Jeremy mentioned, it’s a good idea to keep a lab notebook about the things you try when trying to train models on data. Here is an example of me playing around with California housing data using the same random forest regress or we did today in class. For comparison, I also compare with the results of gradient boosting trees. Please note how easy it is to try out different models on the same data. Also note that I show how to split a data set into training and testing samples.
Please note that the random forest did a pretty good job out of the box without me thinking about the model or doing any massaging of the data.
Are you using your own faster implementation of RF you have mentioned in class before. If so can we use it
This is the standard scikit-learn version My version is in Java but I’ll be porting to Python (which will mean I’m slower than scikit as theirs is C underneath).
BTW, @jeremy, my earlier experiment was all screwed up because I miss typed max_leaf_nodes instead of min leaf node size or whatever it is. doh! I think I also screwed up interpretation of
oob_score_, thinking that was the error and not the score. haha.
Wow all the steps look super handy and intuitive… I thought the process should be much harder but now it looks pretty easy to implement different models.
Yep, as @jeremy says, machine learning doesn’t have to be hard and scikit-learn makes it much simpler than earlier days. You just have to learn what to care about.
One thing to note @parrt is that gradient boosting trees are much more sensitive to hyperparameters than RFs. Generally, doing a grid search for hyperparams is necessary to get good performance, or manually carefully tuning. Here’s tips from former top ranked Kaggler Owen Zhang:
And here’s some more great tuning tips: https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
One thing to note is that this dataset only contains continuous variables, and no missing values, so @parrt was able to skip the steps where we handled these two issues.
They’re not hard either, mind you - just 3 lines of code!
Actually this is probably an even better slide from Owen:
And here’s 2 examples of grid search:
BTW I believe that @yinterian will be covering quite a bit of gradient boosting in part 2 of the course.
Could anyone explain how to use train_cats()? I keep getting the following error when trying to use this function.
AttributeError: ‘NoneType’ object has no attribute ‘items’
Thoughts on pd.get_dummies()? My jupyter notebook crashes whenever I try to create a dummy variables using this function on features that have many many categories.
For those who want to learn more about Xgboost ans it’s capabilities https://arxiv.org/abs/1603.02754. @jeremy What are your thoughts on CatBoost vs Xgb?
train_cats() is only needed with categorical values and CA has none, which also means you don’t need get_dummies()
Generally speaking you shouldn’t need to use
get_dummies() with tree-based methods - and almost certainly not if there are lots of categories. (As Terence mentioned, there aren’t any categorical vars in this dataset anyway).
Sorry, I should have clarified that my question pertains to a different data set that does have categorical variables. Regardless, train_cats() wasn’t working for me and neither was get_dummies() because there were too many “categories” (such as songID) within a couple features – get_dummies() would crash on those. I’m a little stuck on how to move forward. Any suggestions?
@kkibrahim if you could share your notebook, we can take a look. Not really possible to debug any problem without seeing all the steps that lead to it. FYI there’s a great “gist it” extension that makes it easy to share your work as a public gist: http://jupyter-contrib-nbextensions.readthedocs.io/en/latest/nbextensions/gist_it/readme.html
Thanks for taking a look! My code is here: https://github.com/khouryibrahim/kaggle/blob/master/recs3.ipynb
In is where the issue starts.
You manually set the fields to category type, so no need for you to call train_cats. train_cats changes string types to category types, and assumes that you don’t already have category types.
Well, for example, I still get this error when running RandomForestRegressor:
ValueError: could not convert string to float: female
when gender is one of the features I set as a category. I’ve tried various approaches and can’t see to get RandomForestRegressor to run
That means you haven’t run proc_df, which replaces the category fields with their integer codes.