Wiki / Lesson Thread: Lesson 7

(melissa.fabros) #1

This is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

<<< Wiki: Lesson 6Wiki: Lesson 8 >>>

Lesson resources

Finishing up what we know about Random Forests

When to leave the Random Forests

Wiki / Lesson Thread: Lesson 6
About the Intro to Machine Learning category
(Jeremy Howard (Admin)) #2

Just posted the lesson video.

(Asmita Vikas) #3

Question - When we build an ensemble of trees, how does the code know which columns to split on for a particular tree?
For example, tree1 would split on ‘YearMade’ followed by ‘MachineHoursCurrentMeter’,
while tree2 could split on ‘MachineHoursCurrentMeter’ followed by ‘Coupler_System’.

(Davi Schumacher) #5

Asmita, did you ever get an answer for this?

(Jeremy Howard (Admin)) #6

Take a look at the code we wrote in class together, and re-watch the previous 2 lessons before that, where we learnt the steps, and then implemented the code from scratch to do them. Then come back here and summarize as best as you can what we covered in the lessons about this and how the code we wrote works, and then we can help fill in any gaps or make any corrections as needed. How does that sound?..

(Asmita Vikas) #7

thanks Jeremy!
@daschumacher, yes I did understand this part. Forgot to mention it here.
So if we look at the TreeEnsemble class in the code, we are randomly shuffling the order to rows and columns for each tree.
This ensures that each tree would split on columns in a different order.

def create_tree(self):
idxs= np.random.permutation(len(self.y))[:self.sample_sz]
return DecisionTree(self.x.iloc[idxs], self.y[idxs],
idxs=np.array(range(self.sample_sz)), min_leaf=self.min_leaf)

(hector) #8

@jeremy really liked the way distribution concepts were applied to estimate certain aspects of model accuracy confidence intervals etc. Can u point to any good resource that explains the concepts of statistical distributions in a more practical way like this so that we can apply in various scenarios rather than just theory and equations

(Alexander) #9

Question: in the TreeEnsemble for RF we use mean of tree predictions. Can we use logistic regression/simple single-layer neural net to get better weights between the trees? Any downsides (beside increased potential for overfitting and additional hyper-parameters tuning)?

(Joseph Catanzarite) #10

In this post we’ll delve into the math behind the code for std_agg(), the function that @jeremy created to compute the standard deviation of a one-dimensional data vector, discussed in the Lesson #7 video, starting at the 45:32 point.

In the std_agg() function, @jeremy employs the following expression for mean squared deviation:

sum((x - <x>)**2)/N = <x**2> - <x>**2

If you’re curious about this basic result from statistics, read on!

The standard deviation is a measure of the spread or dispersion in a set of real values. Mathematically, it is computed as the root mean squared deviation. Don’t worry if you are unfamiliar with these terms; we’ll discuss them below.

The definition of the standard deviation of a vector of values x is

std(x) = sqrt( sum( (x - <x>)**2 )/N )

In this formula,

  • N is the number of samples (x values), and angle brackets <> around a vector x denotes the expectation value, or mean of x: <x> = sum(x)/N.

  • x - <x> is called the deviation of a sample x from the mean.

  • for simplicity, we’ve used N instead of the usual N - 1 in the denominator, which is fine as long as N is large.

Let’s expand the quadratic squared deviation inside the sum in the preceding equation:

(x - <x>)**2 = x**2 - 2*x*<x> + <x>**2.

Summing the squared deviations over the samples and dividing by the mean squared deviation:

sum((x - <x>)**2)/N = sum( x**2)/N - 2*sum(x*<x>)/N + sum(<x>**2)/N

Let’s examine the right hand side of the preceding equation:

  • The first term is the mean of x**2, which is by definition the expectation value <x**2>

  • In the second term, sum(x*<x>)/N is the same as <x>*sum(x)/N, since <x> is a constant factor and can be taken outside the sum.

  • But look! sum(x)/N is the mean of x which is by definition equal to <x>, the expectation value of x. Therefore sum(x*<x>)/N= <x>*sum(x)/N = <x>*<x> = <x>**2. So the second term becomes -2*<x>**2

  • The third term is just <x>**2; since <x>**2; is a constant, summing over the samples means just multiplying by N.

Adding together the three terms, we find that the mean squared deviation is

sum((x - <x>)**2)/N = <x**2> - 2<x>**2 + <x>**2 = <x**2> - <x>**2

Note that this is the mean of the squares minus the square of the mean

By its definition, the standard deviation is the root mean squared deviation, so now we just need to take the square root:

std(x) = sqrt(sum((x - <x>)**2)/N) = sqrt(<x**2> - <x>**2)

In @jeremy’s notation, cnt is the number of samples N,

s1 is the sum of the sample values x, so <x> = s1/N = s1/cnt,

s2 is the sum of squares of sample values, so <x**2> = s2/N

The formula for standard deviation becomes

std(x) = sqrt( s2/cnt - (s1/cnt)**2 )

And thus @jeremy’s function definition is

def std_agg(cnt,s1,s2): return math.sqrt( (s2/cnt) - (s1/cnt)**2 )

This formula speeds the computation of the standard deviation: its inputs are the sum of the data, and the sum of the squares of the data, both of which are O(N).