Wiki / Lesson Thread: Lesson 4

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 3Wiki: Lesson 5 >>>

Lesson resources

Lesson notes from @melissa.fabros:

Introduction to hyperparameters:

Q: Can you summarize how adjusting the different hyperparameters in a Random Forest (RF) can have an affect on overfitting, collinearity, and etc.
Sounds like a great question born from experience. We’ll look at the lesson 1 notebook with the bulldozer data and analysis to review which RF hyperparameters might be interesting.

One Hyperparameter example: Adjusting sample size of the training data

One benefit of taking a smaller sample set: set_rf_samples(20000) from the fastai library changes how many rows of samples are given to each tree. For each new tree, we either bootstrap a sample (use whole whole dataset, but each tree only sees a random subset of rows) or pull out smaller random subset (in this example, 20k rows). When you use_set_rf_samples(20000) and grow a tree from there (assuming you grow the tree until it has a single leaf), you’ll have a tree depth of size log2(20000) and you’ll have 20k leaf nodes. When you decrease the sample size, there are less decisions for the tree to make.

A second benefit is that you can run your RF models faster and iterate faster because there is less data to process. Generally if you have over a million rows, you’d definitely want to use a subsample. Under 500,000 rows/observations, you’re probably OK with oob_score, and you’ll have to find what’s ok for you and your hardware setup for anything in-between.

The tradeoff between smaller vs bigger sample sizes:
If you use a smaller samples, say set_rf_samples(), you’ll overfit less. When you build a model with bagging, you’re aiming for each tree/estimator to be as strong as possible, while keeping correlation across trees to be as low as possible. When you average the correlation across the different trees, you’ll be close to 0 as possible. Therefore, although you actually have less accuracy per tree/estimator, but the correlation between the trees will be also be less, and your RF model can generalize (make a prediction on new data) better. An example of making this trade-off would be how big or small would you make your validation set, and there is no right answer.

Q: what happens if i don’t use set_rf_sample (aka don’t define the size of my subset)?
The default will bootstrap a sample as big as the original one.

Q: does setting the oob_score() in an RF model change how it uses the sample data?
oob_score() has no affect on training, it only provides a useful metric. One thing to watch out for is having too small of a set_rf_sample and oob_score=True together. If the sample size is too small, the oob sample’s error rate may be huge because there are so few leftover data points after the RFR analysis. The remaining unanalyzed data points may all be outliers but there isn’t any more data to even out the error rate. Say, if you have set_rf_sample(5) and your RFR complete analysis on an array of values [2,3,4]. The oob_score only has values [1,5] left for its own sample set, there you have a error rate of 4.

Diving into RandomForestRegressor parameters (aka Let’s review what’s actually happening under the hood of with all these RF parameters):

Here’s an example of how we instantiate a RF model in Python:
rf_model = RandomForestRegressor(n_estimaters=40, min_samples_leaf=3, max_features=0.5 n_jobs=-1, oob_score=True)


n_estimators: number of trees that your forest will have (more information on this was covered in lesson 4)
n_jobs: how many cores will your forest use (-1 means all the cores available, more notes are in lesson 4)

oob_score: (aka out of the box score) uses the information that wasn’t used by any trees during its decision making processes. The data is set aside and then the error rate is calculated on this data which becomes the oob_score. If you use RandomForestRegressor(oob_score=True) you’ll get a quasi-validation set for free; this can be useful if you don’t have a validation set and a dataset that’s under 500,000 rows.

min_samples_leaf: if you increase the min_samples_leaf, you’re controlling how deep the tree will go before stopping. min_samples_leaf=1 will mean every single sample will have decision for our 20k sample. The tree will have 20k leaves and will have made a decision on every single row in the sample, resulting in more stable predictive average that we calculate for each node. We can have less depth to the tree if we increase min_samples_leaf=3, the tree will stop growing when a branch has 3 leaves/nodes. Each estimator will be less predictive on it’s own, On other hand, the estimators are less correlated, and your model will generalize to new data better.

By increasing, the min_samples_leaf your model will run faster because, there are less decisions for the tree to make. If you set min_sample_leaf=1, you’ll have log2 (num_sample) number of leaves/final decisions; if you set min_sample_leaf=2, the number of leaves wil be log2 (num_sample) - 1 ; if you set min_sample_leaf=3, the number of leaves will be log2 (num_sample) - 2. Every next level of decision making will take twice as long as the previous level, so for every increase in leaf node you’ll save log2 - 1 in calculation time.

:writing_hand: Knowing how to dealing with logarithmic calculations (you might hear "log " or “take the log of”) comes up all the time to with machine learning. They provide an easier mental shorthand for handling very large/small numbers as in the above example :writing_hand:

In RFR when we walk about “information”, we mean “information gain” (what we’re trying to optimize in the tree, whatever the loss /scoring you’re using, the amount you score is improving as you move from one split to another). You’ll have the most information gain from the earlier splits and with smaller steps of information in near the end because there are fewer decisions to be made. You can stop the tree from growing until it reaches 1 leaf because that last decision may not in fact gain you any more useful information. If you feel, that your RF is over-fitting or the model is slow, you can try increasing or decreasing the values along the lines of: 1, 3, 5, 10, 25, 100

max_features: This controls how many features you’re exposing to each node. If we set max_features, we are choosing a random samples from the column features of the data, by setting the % of features that a tree is looking at while it’s making a decision. At each branch split, the tree will select from subset of features that you’ve decided to show to it.
You want the features that the tree has available to choose from to be as rich as possible.

If you have a small number of trees and you show it all the same variables (e.g. max_features=1) The tree is not getting much variety in what it feature it gets to split on. By limiting the max_features, the effect would be that each tree is less accurate, but trees’ final decisions will be more varied.

Imagine that a single feature variable is so predictive, then all trees will be similar (e.g., they all made same initial splits) if we have all variables available at each split. WIth different subsets of feature variables, the tree will try other interesting splits. You are better able to capture other important features than the very predictive one. Even if you add more trees, with full variable selection, the accuracy score won’t improve much because they’re all making the same initial decision split. If we set max_features with sqrt/log2, we get much more improvement with more trees. For tuning max_features values: None, sqrt, 0.5 seem to work well.

RFR bootstraps by default, so sklearn’s RFR has max_features=.63 as the preset value.

Q: What might happens if you max_features to 100% and pass in all the data to a single tree?
If you had an RFR with 1 tree, 1 min leaf, and use 1 all the samples from the start, which translates into python as:
rf_model = RandomForestRegressor(n_estimaters=1, min_samples_leaf=1, max_features=1 n_jobs=-1, oob_score=True)

The tree sees all the data in a single analysis, therefore the oob sample would be empty. Therefore, you wouldn’t have a oob_score.

Hyperparameter and RFR parameter tuning

When you’re doing any type of machine learning you need to negotiate whether adjusting hyperparameters or your model’s parameter will improve your model.

A parameter is a model, the actual values that a models needs: With random forests, it may 1 tree or multiple trees, leaf_node numbers, how many splits, and etc.

Adjusting Hyperparameters means making decision on how we use the model:
Do we use 1 big tree or several small trees? How did we sample the data to create a sample set or validation set: was the sample or validation set too big/small? does our sample and validation set have enough of the relevant features of the training data?

We’re getting to part of the data science problems where you have to bring your creativity and experience on how to strategize about how to proceed with models types and number in order to execute analysis of and negotiate the problems of large datasets

If you want to test all your skills, try the Kaggle Favorita grocery competition. The case of the grocery competition is one where it is simply not enough to make a prediction on the given data. The data set has high cardinality, in other words it has lots of different types categorical variables. There’s a lot of details to get right about where you join data and how you determine which features are independent or dependent. You also have to be creative on how you model processes all the ancillary data and whether or not it would be strategic to add even more data. And even if all your programming skills are sound, your model still may underperform and you have to diagnose why.

Q: What’s an example of a sign that can tell me that I need to adjust a parameter or hyperparameter?
This depends on the type of problem and there’s not one definite answer. Only experience with lots of datasets and types of data will give you the intuition on what to change.

But one example might be your that if oob_score is fine, yet your validation score accuracy goes down. You may need to rethink your how you built your validation set from the training data. Your validation set isn’t giving you good feedback on how your model works because it isn’t being representative of your training data. When you explored important features and dependent variables for the validation set, those features weren’t actually trends in the rest of the training set.

If on the other hand, your oob_score goes down and your validation set score got worse, the first diagnosis could be the models is over-fitting. To test this, create a second validation random set and if that also gets worse, you are indeed over-fitting and need to rethink some of the your model’s parameters.

Go forth and build Random Forest models…

The goal of the class for Jeremy is to teach us enough to go forth and build trees on all kinds of other datasets.

Building the best RFR you can in summary means:

  1. Change all the variable data to numerics as granular (for example take a timestamp and separate out time, year, month, days, hours, minutes, anything you can)
  2. change categorical data to numbers (aka dummy encoding or one-hot eoncoding)
  3. change NaN or empty values to booleans
  4. Use a forest and not one tree

… However, Interpretation of the Model is as Important as the Prediction

Finding Feature Importances:

Start with every possible that could possibly interesting and then winnow the features down later with feature importance analysis. For example, splitting out all the timestamp data. We don’t expect throwing out features to improve accuracy of the model at first. Feature importance tells us where to spend our time in analysis.

You’ll want to use the RF to look at feature importance over linear regression. Although some people use linear regression (LR) coefficient’s to find important features, linear regression tools are not at all as good as RF importances. For must LR to work right, you need perfectly perfected data set (If you forgot to normalize/scale a data point etc, LR will fail you). RF is much more tolerant of different data types and scales. For example, RF can still handle categorical data and LR can’t.

After running feature importance on your RF model. You can plot the results as a histogram.
We can then remove lower impact features/columns; features with values of 0.05 and lower are likely not so important. Go ahead and drop these features and then retrain the model. You’ll have a simpler model after dropping features but should have the same or better accuracy on your data. You can also rerun features_importance and features that had similar importance values at first should appear in the histogram as having more stark differences.

Finding Feature Interaction:

However, you also must look at all variables together otherwise miss interactions or between features. For example in bulldozer sales data, year_made and year_sold probably has a strong interaction because this interaction tells you how old the equipment is. You also can’t look feature one at a time as they can be correlated, and all the features will look important on its own.

Encoding data for better interpretations

RFR can handle categorical data fine. Unlike linear regression models, RFR will still work if you leave the original categorical labels unchanged. But we’d like to look at how often a feature, for our purposes bulldozer ‘enclosure’, is associated with another feature with more nuance.
We’ll use “one-hot encoding” aka “dummies encoding” aka “replace the named categories with 1s and 0s across the different features.

proc_df(max_n_cat=7) from the fastai library will do a lot of the heavy lifting of changing the string label/category to a number. max_n_cat=7 tells the function to check the dataframe for features that have up to 7 different string values within the category. If a category has cardinality levels than number more than 7, proc_df() will leave switch out the string label for an integer. In other words, the categorical label will replaced an ordinal number. If the category has less than 7, proc_df will change the category to 1-hot-encoding.

One-hot-encoding can add lots and lots and lots more columns to your dataset. One hot-encoding may not improve accuracy, but it can help with interpretability of the model’s important features. You can see which features are often associated with another technical feature.

Intro to Spearman Rank Correlation (aka Using Cluster Analysis to Identify Redundant Features)

You can do cluster analysis on features with hierarchical clustering techniques. You can find the correlation coefficient between 2 variables for a distance measure–the bigger the difference the variables are less correlated; the smaller the difference the variables of more correlated. Using Spearman’s coefficient analysis actually calculates a rank correlation, and this technique groups features that are likely measuring the same thing.

Running a Spearman’s coefficient analysis allows you to look for features that you can remove without making the oob_score worse. Once you identify possibly redundant variables, you can experiment with dropping possibly redundant variables. A simpler model with less noisy data is better, so you can drop columns that don’t affect modeling. Once you build your relevant feature set and winnowed down the dataset, you can compare the new model’s performance (accuracy measurements of oob, RMSE etc) against the model against the model trained on full unedited dataset. Accuracy should be about the same or even better.

Now we’ve identified features we can easily get rid of and you can tell a client right away which features of their data are extraneous to answering their business objectives.

Understanding “Partial dependence” for important features

Understanding “Partial dependence” for important features
Now we understand data better using the RFR model. We’ve found the dataset’s most important features and least important features, and we’ve identified and which features are redundant.
We now generally know what are the most important drivers behind our Bulldozer pricing model’s prediction.

But is that enough? An important driver might only be found as an interaction of two features. For example, year_made and sales_elapsed (the date of the auction sale) reveals how old the bulldozer is. We don’t have a bulldozer_age feature, and we can only derive this interaction from the two features. Intuitively, we understand that newer cars usually sell for more than older cars; the same is likely true of bulldozers. But how do we test this?

Partial dependence plot (PDP) is our tool of choice for this task. The goal would be to isolate the two features so we can see the strength of their interaction’s influence. Our first step would be to look at feature importance to get an overview of the most important features. Then running cluster analysis again (after we dropped the redundant features) can reveal which features remain clustered groups as a place to start investigating for interaction effect.

In our case, our feature importance and cluster analysis tells us year_made against sales_elapsed are both important, but what precisely is that relationship? To find the partial dependence of sales_price on against year_made, we take a smaller randomly selected dataset for plotting (maybe 100 or so random rows). For each row in our sample, you would keep every value of each feature the same except for year_made. Next you would replace the real year_made value with every year_made value range(1950,2001). For every iteration, you run your trained RFR on each single row after modifying year_made. You record the model’s price prediction, and then you can calculate the mean of all the predicted prices for every year. Plotting the predictions and the mean will show a rise in the influence of year_made on sales_price. So, PDP can really show what drives price and helps understand underlying business trends.

PDP is not intuitive, and it can be hard to grasp or articulate its nuances. Some long-time data scientists don’t believe in their hearts that PDP is valuable. But it provides insight into subtleties of your data. To help with understanding PDP, a good discussion of partial dependence is happening here.

Ultimately, you need to understand how the data works so you can take an action either to promote or prevent the prediction For example, Jeremy worked on a case where he developed a model that predicted which cell phone users are going to leave the service. One of the top 5 predictors was length that customers spent with customer service. Instead of firing the support people to save costs, the telecom company understood that spending money on customer support actually made the company money. Predicting who will drop the service is less valuable than understanding why they left.

RFR models are both for predictions and for understanding relationships in data because these are the actual levers that you can pull to change how you do business or change consumer behavior.

(hat tip to Terence Parr, Tim Lee and Christopher Csiszar for sharing notes and insights! Thanks!)


I’ve just added the lesson video to the top post

Hi Jeremy,
When we use function set_rf_samples(20000), is it going to use bootstrap procedure to grab 20000 sub-sample from the whole dataset or just randomly picking 20000 rows from the dataset without replacement?

I believe it is bootstrap, since it calls the same code as the regular RF sampling does.

Thanks for the clarification!

@jeremy Can you please help turn this into wiki thread? I just remember that there are another few wiki posts under this category which need to be turn into wiki thread. Kindly help on this. Thanks.


@jeremy I have a question regarding the proc_df max_n_categories parameter introduced in this lesson for one-hot encoding:
It is clear what it does and how it works in the context of the lesson, but using it in practice gives me headaches in e.g. kaggle competitions. Calling proc_df on a testset with this option enabled sometimes leads to more/less columns created as compared to the traindf, depending on the presence of attributes in the testset. This leads to 'Number of features of the model must match the input.' errors.

Maybe I have missed something, but there seem to be methods for using proc_df for the nas (na_dict) and for the scaling (mapper), is there something similar for the one-hot-encoded columns that need to be added to the testset later too?

1 Like

Continuing with @miwojc’s excellent project – leveraging Kaggle’s resources to make the Introduction to Machine Learning for Coders course available to those who don’t have access to a GPU – I’ve put @jeremy’s Lesson 4 and Lesson 5 notebooks into Kaggle kernels that anyone can run and play with.

Lesson 4 notebook

Lesson 5 notebook


thanks @jcatanza
links to lessons 1-3 are here: :slight_smile:

Hi @miwojc you are welcome to update that post by linking the Kaggle kernels for Lesson 4 and Lesson 5!

nice one. done! it’s now complete. thanks!

1 Like

Using Jupyter on Crescent…

Getting an error including

from pdpbox import pdp
Can someone tell what;s wrong…

ModuleNotFoundError: No module named ‘pdpbox’

Try : pip install pdpbox
Before that code, i think pdpbox is not installed in your jupyter notebook

Is the above quote wrong??
Even if there is one estimator, that tree will take only part of the data and still there will be some samples which are left unused. So, oob sample will not be empty

One thing is confusing to me. After doing one-hot encoding feature importance is changed. However, when dendogram is used it uses the df_keep, which I guess is the feature importance columns before onehot encoding, is this a mistake or intentionally done

Can anybody explain why we One Hot encode?
As Jeremy said in the class that we want to split categorical variables into individual columns, but wouldn’t splitting into columns create more columns with categories?
e.g Is_High has cardinality 2

According to my understanding, one hot depends on your problem. That means you need to understand, whether one hot encoding makes sense.
As far your question, you need to tell model, the categories are different from each other. How you are doing it?
you are telling that one category is a number. for example (good=1, better=2, best=3). you are telling that difference between good and bad is 1, same for better and best. Is that the case? Depends on your problem…
However in case of other categories (apple=1, orange=2, lemon=3). If you categorize them, you can see the codes, where also, you are telling that, the difference between apple and orange is one. Actually it is not.
Basically one hot encoding can help you in this can say that apple(1,0,0) orange(0,1,0), lemon(0,0,1). Now you are telling your model they are different things and numerical difference is also minimized. Therefore, you need to understand which model you are trying and on those model one hot Encoding makes any sense or not. More columns is not a problem. Problem is unnecessary columns. You have to understand which column is necessary. In case of random forest you can use feature importance( same how Jermy shows)

1 Like

If anyone is still working on this, I have a question about the impact of OHE.

For me, OHE with max cardinality of 7, reduces the validation score by around 1%.
I am curious why this is happening… and if it’s a case of now re-adjusting some hyper parameters.

Does anyone have an insight into why OHE would make the model worse?

It seems intuitive to me that it would make the model better, by allowing select important specific categories with less decisions in the tree, making them richer.

My only theory is it is interplay between OHE and the max features. That by doing OHE we are allowing the model to use “more” information when making each decision. But I tried lowering max features and it didn’t help.

Is Confidence based on Tree Variance as good a metric for RandomForestClassifier as it is for RandomForestRegressor? I got this doubt while working on the RF Interpretation for a classification problem. The code Jeremy used in the lesson for this metric for the Bulldozers’ Regressor was this:

%time preds = np.stack([t.predict(X_valid) for t in m.estimators_])
np.mean(preds[:,0]), np.std(preds[:,0])

So what we’re doing there is to see how each tree varies from the mean of the predictions of all trees. But is the mean of the trees’ predictions as reliable for Classification as it is for Regression in the first place?