Unofficial Lesson 7 Classnotes


(Tim Lee) #1

ML-Lecture5.2

import sys
sys.path.append('/Users/tlee010/Desktop/github_repos/fastai/')
%load_ext autoreload
%autoreload 2
%matplotlib inline

import sys

# or wherever you have saved the repo
sys.path.append('/Users/tlee010/Desktop/github_repos/fastai/')

from fastai.imports import *
from fastai.structured import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

Previously: Random Forests in General

Downside - can’t extrapolate very well.

They are a great starting point. There’s a huge catalog of machine learning tools out there. Books don’t curate it down. Why would you use a Support Vector Machine today? Theoretically elegant, and exciting previously, but practically dont see them having any place today.

Hard to write about Random Forests, since its hard to look at it theoretically. GBM / XGBOOST etc. Now industry is more invested in the algorithms and are doing more work.

Packages being used today , top 5 or 6 did not exist more than 3 years ago.

Examples:

  • Feature interactions for RF?
How big should your validation set be?

First need to answer the question, how accurate does this algorithm need to be?

parameter model 1 model 2
n 2000 2000
acc 99.4% 99.2%
wrong 12 16
cats 50% 50%
wrong cats 6 8

There’s a practical consideration here. 0.2% could be a big deal depending on what the application. 22 is where the t distribution turns into a normal distribution.

Rule of Thumb 1 - Once we have a n > 22, it behaves more normally. So 22 out of 1000 etc.

Rule of Thumb 2 - If you have binomial data

Metric Binomial
np mean
np(1-p) std
std/sqrt(n) SE

Rule of Thumb 3 - training set should have an equal number of the rare object, you should over sample until its the same size as the common class. There’s also a way to weight the uncommon class, so its picked more often

** one shot learning ** - recognize somethign that you have only seen once
** zero shot learning ** - not covered in this class, but recognize something you never seen before

1. Returning to our Random Forest Class

HW - add all the other RF interpretations that we have already covered

import feather
df_raw = feather.read_dataframe('/Users/tlee010/kaggle/bulldozers/bulldozers-raw')
print('import complete')
import complete
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')
def split_vals(a,n): return a[:n], a[n:]
n_valid = 12000
n_trn = len(df_trn)-n_valid
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)
raw_train, raw_valid = split_vals(df_raw, n_trn)
x_sub = X_train[['YearMade', 'MachineHoursCurrentMeter']]

1.1 Basic Data Structure

class TreeEnsemble():
    def __init__(self, x, y, n_trees, sample_sz, min_leaf=5):
        np.random.seed(42)
        self.x,self.y,self.sample_sz,self.min_leaf = x,y,sample_sz,min_leaf
        self.trees = [self.create_tree() for i in range(n_trees)]
    
    def create_tree(self):
        # get random indexes
        rnd_idxs = np.random.permutation(len(self.y))[:self.sample_sz]
        
        # then return a decision tree based on a random dataset
        return DecisionTree(self.x.iloc[rnd_idxs], self.y[rnd_idxs], min_leaf=self.min_leaf)
        
    def predict(self, x):
        
        # currently returning the mean of the prediction of all the 
        # trees
        return np.mean([t.predict(x) for t in self.trees], axis=0)

1.2 Decision Tree Version 1

class DecisionTree():
    def __init__(self, x, y, idxs=None, min_leaf=5):
        self.x,self.y,self.idxs,self.min_leaf = x,y,idxs,min_leaf

2. Now we need to flesh out the decision Tree.

We are going to write it as if everything exists

2.1 Decision Tree Version 2

class DecisionTree():
    def __init__(self, x, y, idxs=None, min_leaf=5):
        """
        x : input features
        y : target variable
        idxs: keep track of what rows went where
        """
        
        # if its the first run, set to all
        if idxs is None: idxs=np.arange(len(y))
            
        # store the data, and count rows and cols
        self.x,self.y,self.idxs,self.min_leaf = x,y,idxs,min_leaf
        self.n,self.c = len(idxs), x.shape[1]
        
        # every node in the tree will have the mean
        # target variable
        self.val = np.mean(y[idxs])
        
        # how effective was the split at this tree
        # only if not a leaf node
        self.score = float('inf')
        
        # having building root, which vari
        # should we split and how much should we split on?
        self.find_varsplit()
        
        
    # This just does one decision; we'll make it recursive later
    def find_varsplit(self):
        
        # go through each variable, and find a better split
        # assuming all features are sent
        for i in range(self.c): self.find_better_split(i)
            
    # We'll write this later!
    def find_better_split(self, var_idx): pass
    
    
    """
    Property means, dont need to call ()
    will be calculated on the fly
    """
    
    @property
    def split_name(self): return self.x.columns[self.var_idx]
    
    @property
    def split_col(self): return self.x.values[self.idxs,self.var_idx]
        
    @property
    def is_leaf(self): return self.score == float('inf')

        
    def __repr__(self):
        """
        This is overloading the print function
        so that everytime we print the decision tree
        it will print some sort of data
        """
        s = f'n: {self.n}; val:{self.val}'
        if not self.is_leaf:
            s += f'; score:{self.score}; split:{self.split}; var:{self.split_name}'
        return s
m = TreeEnsemble(X_train, y_train, n_trees=10, sample_sz=1000, min_leaf=3)

Test our repr_ implementation

m.trees[0]

Look at all the indexes in the tree (hint: all)

m.trees[0].idxs[:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

2.2 Comparison to Sklearn

ens = TreeEnsemble(x_sub, y_train, 1, 1000)
tree = ens.trees[0]
x_samp,y_samp = tree.x, tree.y
x_samp.columns
Index(['YearMade', 'MachineHoursCurrentMeter'], dtype='object')

Scikit Learn - 1 Tree, 1 Decision, No bootstrapping

m = RandomForestRegressor(n_estimators=1, max_depth=1, bootstrap=False)
m.fit(x_samp, y_samp)
draw_tree(m.estimators_[0], x_samp, precision=2)

svg

For our function - find a better split

Given this subset of a data we are going to try to find a value to split on. We will find the split that minimizes the stddev of each group. Its the same as minimizing RMSE

YM Price group 1 group 2
3 1
4 1
1 0
6 1
4 0
1 0

Find Better Split v1

Let’s implement. Remember previously:

 # We'll write this later!
    def find_better_split(self, var_idx): pass
def find_better_split(self, var_idx):
    """
    will get the indexes 
    """
    x,y = self.x.values[self.idxs,var_idx], self.y[self.idxs]
    
    # go through every single value
    for i in range(1,self.n-1):
        
        # separate out the data, arrays of booleans
        # since doing a comparative
        lhs = x<=x[i]
        rhs = x>x[i]
        
        # if there's no real split, then stop
        if rhs.sum()==0: continue
            
        # otherwise calculate the two STDs
        lhs_std = y[lhs].std()
        rhs_std = y[rhs].std()
        
        #then average the score
        curr_score = lhs_std*lhs.sum() + rhs_std*rhs.sum()
        
        # is this better than our score so far
        # its intialized as inf.
        # if its better, then we store
        # variable, ID, SCORE, and what it is
        if curr_score<self.score: 
            self.var_idx,self.score,self.split = var_idx,curr_score,x[i]

From Above

ens = TreeEnsemble(x_sub, y_train, 1, 1000)
tree = ens.trees[0]
%timeit find_better_split(tree,1)
tree
75.7 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)





n: 1000; val:10.079014121552744; score:681.0184057251435; split:3744.0; var:MachineHoursCurrentMeter

Test it out on a tree

find_better_split(tree,0); tree
n: 1000; val:10.079014121552744; score:658.5510186055949; split:1974.0; var:YearMade

Baseline Computational Complexity? n^2

  • Consider loops
  • Sorts
  • comparisons
    # will do n times
    for i in range(1,self.n-1):
        
        # will also compare against all values so N again
        lhs = x<=x[i]
        rhs = x>x[i]

3. Find Better Split : Speeding Things up

tree = TreeEnsemble(x_sub, y_train, 1, 1000).trees[0]

Variance

$$ mean(xy) - mean(x)mean(y) $$

Common language: sort the data, keep track of sum + count of the dataset and go through the dataset once. And if we have sum + count, we can keep calculating the mean.

YM Price
1 0
1 0
3 1
4 1
4 0
6 1
def std_agg(cnt, s1, s2): return math.sqrt((s2/cnt) - (s1/cnt)**2)

3.1 Find Better Split v2

def std_agg(cnt, s1, s2): return math.sqrt((s2/cnt) - (s1/cnt)**2)

def find_better_split(self, var_idx):
    x,y = self.x.values[self.idxs,var_idx], self.y[self.idxs]
    
    sort_idx = np.argsort(x)
    sort_y,sort_x = y[sort_idx], x[sort_idx]
    rhs_cnt,rhs_sum,rhs_sum2 = self.n, sort_y.sum(), (sort_y**2).sum()
    lhs_cnt,lhs_sum,lhs_sum2 = 0,0.,0.

    for i in range(0,self.n-self.min_leaf-1):
        xi,yi = sort_x[i],sort_y[i]
        lhs_cnt += 1; rhs_cnt -= 1
        lhs_sum += yi; rhs_sum -= yi
        lhs_sum2 += yi**2; rhs_sum2 -= yi**2
        if i<self.min_leaf or xi==sort_x[i+1]:
            continue
            
        lhs_std = std_agg(lhs_cnt, lhs_sum, lhs_sum2)
        rhs_std = std_agg(rhs_cnt, rhs_sum, rhs_sum2)
        curr_score = lhs_std*lhs_cnt + rhs_std*rhs_cnt
        if curr_score<self.score: 
            self.var_idx,self.score,self.split = var_idx,curr_score,xi
%prun find_better_split(tree,1)

Example Output

         797 function calls (793 primitive calls) in 0.002 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.002    0.002 <ipython-input-32-89db6b5dcd57>:3(find_better_split)
      318    0.000    0.000    0.000    0.000 <ipython-input-32-89db6b5dcd57>:1(std_agg)
        1    0.000    0.000    0.000    0.000 internals.py:3675(_interleave)
        3    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.000    0.000 {method 'argsort' of 'numpy.ndarray' objects}
      318    0.000    0.000    0.000    0.000 {built-in method math.sqrt}
        1    0.000    0.000    0.002    0.002 {built-in method builtins.exec}
        2    0.000    0.000    0.000    0.000 numerictypes.py:942(_can_coerce_all)
        1    0.000    0.000    0.000    0.000 cast.py:1049(find_common_type)
        1    0.000    0.000    0.000    0.000 generic.py:3661(_protect_consolidate)
    32/28    0.000    0.000    0.000    0.000 {built-in method builtins.len}
 ...

3.2 Python Method for assigning functions, (assigning Find Better Split)

DecisionTree.find_better_split = find_better_split

DecisionTree.find_better_split - our method in our custom class

find_better_split <- the v2 of the find_better_split

DecisionTree.find_better_split = find_better_split
tree = TreeEnsemble(x_sub, y_train, 1, 1000).trees[0]; tree
n: 1000; val:10.079014121552744; score:658.5510186055565; split:1974.0; var:YearMade

Note this is the same as what we ran separately before (see a few blocks up)

find_better_split(tree,0); tree

n: 1000; val:10.079014121552744; score:658.5510186055949; split:1974.0; var:YearMade

4. How to do more than 1 level?

Back to Scikit Learn Comparison with 2 levels

Now we will have to find a decision split in both child nodes to make the next level down.

m = RandomForestRegressor(n_estimators=1, max_depth=2, bootstrap=False)
m.fit(x_samp, y_samp)
draw_tree(m.estimators_[0], x_samp, precision=2)

svg

4.1 Find_varsplit version 2 (improvements)

Old version:

  # This just does one decision; we'll make it recursive later
    def find_varsplit(self):
        for i in range(self.c): self.find_better_split(i)
def find_varsplit(self):
    
    # is there something better
    for i in range(self.c): self.find_better_split(i)
    if self.is_leaf: return
    x = self.split_col
    
    # gets the boolean array, but gets the indexes
    lhs = np.nonzero(x<=self.split)[0]
    rhs = np.nonzero(x>self.split)[0]
    
    # now that we have the indexes we can create a decision tree
    self.lhs = DecisionTree(self.x, self.y, self.idxs[lhs])
    self.rhs = DecisionTree(self.x, self.y, self.idxs[rhs])

Assign it to our class

DecisionTree.find_varsplit = find_varsplit
tree = TreeEnsemble(x_sub, y_train, 1, 1000).trees[0]; tree
n: 1000; val:10.079014121552744; score:658.5510186055565; split:1974.0; var:YearMade

4.2 Checking Answers

Lets look at the left hand and right hand side

tree.lhs
n: 159; val:9.660892662981706; score:76.82696888346362; split:2800.0; var:MachineHoursCurrentMeter
tree.rhs
n: 841; val:10.158064432982941; score:571.4803525045031; split:2005.0; var:YearMade

What about level2

tree.lhs.lhs
n: 150; val:9.619280538108496; score:71.15906938383463; split:1000.0; var:YearMade
tree.lhs.rhs
n: 9; val:10.354428077535193
tree.rhs.lhs
n: 813; val:10.140438373816052; score:550.1806493212129; split:1994.0; var:YearMade
tree.rhs.rhs
n: 28; val:10.669849650935848; score:16.460871194578985; split:895.0; var:MachineHoursCurrentMeter

Again compare to Sklearn (look at samples)

draw_tree(m.estimators_[0], x_samp, precision=2)

svg

5. Writing our RF Predictor

Lets look through a basic prototype of the prediction function

cols = ['MachineID', 'YearMade', 'MachineHoursCurrentMeter', 'ProductSize', 'Enclosure',
        'Coupler_System', 'saleYear']
%time tree = TreeEnsemble(X_train[cols], y_train, 1, 1000).trees[0]
x_samp,y_samp = tree.x, tree.y
CPU times: user 287 ms, sys: 6.72 ms, total: 294 ms
Wall time: 293 ms

The Sk learn version

m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False)
m.fit(x_samp, y_samp)
draw_tree(m.estimators_[0], x_samp, precision=2, ratio=0.9, size=7)

svg

def predict(self, x): return np.array([self.predict_row(xi) for xi in x])
    
DecisionTree.predict = predict

Inline If functions

if x > 1:
    x = "greater"
else:
    x ="lesser"

Can be written as the following:

x = "greater" if x > 1 else "lesser"
def predict_row(self, xi):
    
    #if leaf return
    if self.is_leaf: return self.val
    
    # otherwise return lhs or right hand side
    # this is an exampel of inline if value
    t = self.lhs if xi[self.var_idx]<=self.split else self.rhs
    return t.predict_row(xi)

DecisionTree.predict_row = predict_row
%time preds = tree.predict(X_valid[cols].values)
CPU times: user 89 ms, sys: 2.18 ms, total: 91.2 ms
Wall time: 90.6 ms

6. Looking at scores

plt.scatter(preds, y_valid, alpha=0.05)
<matplotlib.collections.PathCollection at 0x1c1e4ca438>

png

metrics.r2_score(preds, y_valid)
0.50371522136882341

Compare to the Sklearn Plot

m = RandomForestRegressor(n_estimators=1, min_samples_leaf=5, bootstrap=False)
%time m.fit(x_samp, y_samp)
preds = m.predict(X_valid[cols].values)
plt.scatter(preds, y_valid, alpha=0.05)
CPU times: user 3.38 ms, sys: 632 µs, total: 4.02 ms
Wall time: 3.44 ms





<matplotlib.collections.PathCollection at 0x1c1e669240>

png

metrics.r2_score(preds, y_valid)
0.47541053100694763

6. Assembling the Final Version

And below we cut and paste all the necessary functions and collect it in two neat classes as listed down below

class TreeEnsemble():
    def __init__(self, x, y, n_trees, sample_sz, min_leaf=5):
        np.random.seed(42)
        self.x,self.y,self.sample_sz,self.min_leaf = x,y,sample_sz,min_leaf
        self.trees = [self.create_tree() for i in range(n_trees)]

    def create_tree(self):
        idxs = np.random.permutation(len(self.y))[:self.sample_sz]
        return DecisionTree(self.x.iloc[idxs], self.y[idxs], 
                    idxs=np.array(range(self.sample_sz)), min_leaf=self.min_leaf)
        
    def predict(self, x):
        return np.mean([t.predict(x) for t in self.trees], axis=0)

def std_agg(cnt, s1, s2): return math.sqrt((s2/cnt) - (s1/cnt)**2)
class DecisionTree():
    def __init__(self, x, y, idxs, min_leaf=5):
        self.x,self.y,self.idxs,self.min_leaf = x,y,idxs,min_leaf
        self.n,self.c = len(idxs), x.shape[1]
        self.val = np.mean(y[idxs])
        self.score = float('inf')
        self.find_varsplit()
        
    def find_varsplit(self):
        for i in range(self.c): self.find_better_split(i)
        if self.score == float('inf'): return
        x = self.split_col
        lhs = np.nonzero(x<=self.split)[0]
        rhs = np.nonzero(x>self.split)[0]
        self.lhs = DecisionTree(self.x, self.y, self.idxs[lhs])
        self.rhs = DecisionTree(self.x, self.y, self.idxs[rhs])

    def find_better_split(self, var_idx):
        x,y = self.x.values[self.idxs,var_idx], self.y[self.idxs]
        sort_idx = np.argsort(x)
        sort_y,sort_x = y[sort_idx], x[sort_idx]
        rhs_cnt,rhs_sum,rhs_sum2 = self.n, sort_y.sum(), (sort_y**2).sum()
        lhs_cnt,lhs_sum,lhs_sum2 = 0,0.,0.

        for i in range(0,self.n-self.min_leaf-1):
            xi,yi = sort_x[i],sort_y[i]
            lhs_cnt += 1; rhs_cnt -= 1
            lhs_sum += yi; rhs_sum -= yi
            lhs_sum2 += yi**2; rhs_sum2 -= yi**2
            if i<self.min_leaf or xi==sort_x[i+1]:
                continue

            lhs_std = std_agg(lhs_cnt, lhs_sum, lhs_sum2)
            rhs_std = std_agg(rhs_cnt, rhs_sum, rhs_sum2)
            curr_score = lhs_std*lhs_cnt + rhs_std*rhs_cnt
            if curr_score<self.score: 
                self.var_idx,self.score,self.split = var_idx,curr_score,xi

    @property
    def split_name(self): return self.x.columns[self.var_idx]
    
    @property
    def split_col(self): return self.x.values[self.idxs,self.var_idx]

    @property
    def is_leaf(self): return self.score == float('inf')
    
    def __repr__(self):
        s = f'n: {self.n}; val:{self.val}'
        if not self.is_leaf:
            s += f'; score:{self.score}; split:{self.split}; var:{self.split_name}'
        return s

    def predict(self, x):
        return np.array([self.predict_row(xi) for xi in x])

    def predict_row(self, xi):
        if self.is_leaf: return self.val
        t = self.lhs if xi[self.var_idx]<=self.split else self.rhs
        return t.predict_row(xi)
!pip install Cython
Requirement already satisfied: Cython in /Users/tlee010/anaconda/envs/dsi/lib/python2.7/site-packages
!conda install Cython
Fetching package metadata ...........
Solving package specifications: .

Package plan for installation in environment /Users/tlee010/anaconda/envs/dsi:

The following packages will be UPDATED:

    cython: 0.24-py27_0 --> 0.26.1-py27h6a053f9_0

Proceed ([y]/n)? ^C

7. Cython - more speed improvements

A super set of python

%load_ext Cython

Pure Python

def fib1(n):
    a, b = 0,1
    while b < n:
        a,b = b, a +b
%timeit fib1(50)
308 ns ± 8.24 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Cython Wrapper

%%cython
def fib2(n):
    a, b = 0,1
    while b < n:
        a,b = b, a +b
%timeit fib2(50)
306 ns ± 6.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Cython version with Typing Specified

%%cython
def fib3(int n):
    cdef int b = 1
    cdef int a = 0
    cdef int t = 0
    while b < n:
        t = a
        a = b
        b = t + b
%timeit fib3(50)
51.4 ns ± 0.903 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Homework

Confidence Based Tree Variance

Feature Importance

Partial Dependence

Tree Interpreter

Bonus : Interaction Importance



About the Intro to Machine Learning category
Wiki / Lesson Thread: Lesson 7