Class Notes Section 2 - Wk2Lec2


(Tim Lee) #1

ML-Lecture2 with Random Forests with in-class Notes

Note, i had to

$source activate fastai 

to make sure the library import works otherwise you might have opencv errors.

Loading the libraries from the first two blocks with random forests

%load_ext autoreload
%autoreload 2
# load libraries and edit the modules so you can use them


%matplotlib inline
# allows inline plotting
import sys
sys.path.append("/Users/tlee010/Desktop/github_repos/fastai/") # go to parent dir

from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display

from sklearn import metrics

Where does a particular python command it come from?

The left most term is the library where the command comes from

display
<function IPython.core.display.display>

what else can we learn about a command?

?display

Source code for the Python?

??display

What parameters does this python function take?

SHIFT + TAB

SHIFT + TAB x2 brings up documentation

SHIFT + TAB x3 new window with documentation

PATH = '/Users/tlee010/Desktop/github_repos/fastai/data/bul'

What is git ignore?

It’s a hidden file that keeps track of files that you dont want to be replicated to the server. .gitignore. You can use wildcards to exclude large numbers of files. For instance tmp* would ignore all the tmp prefixed folders. You can also put it in subdirectories.

This is really useful for ignoring:

  • large datasets (don’t want to post on github)
  • credentials (keys)
  • configuration files
  • backup files
  • scratch files

What is symlink?

lets you alias folders if you want to point to another folder
# documentation to create a symlink
$man lm

Let’s get the data

Download (need to login) and unzip

If you don’t have unzip:

$brew install unzip
$unzip file.zip

check the file location

!ls -l /Users/tlee010/kaggle/bulldozers/
total 246368
-rwxr-xr-x@ 1 tlee010  staff  116403970 Jan 24  2013 e[1me[31mTrain.csve[me[m
-rw-r--r--@ 1 tlee010  staff    9732240 Oct 26 14:20 Train.zip

Lets look at the system path

Add { } to force it to use bash PATH instead of python PATH

!ls {PATH}
ls: /Users/tlee010/Desktop/github_repos/fastai/data/bul: No such file or directory

How can you download straight to AWS? - Firefox trick

CTRL-SHIFT-I for developer console within Firefox

  1. If you hit download, and track the actual link that is being used.
  2. Pause it
  3. right click the download record, you should get a long curl string It has the cookies / auth details. You can paste in the AWS. Then need to add a ‘-o filename.zip’. That should download very quickly

Bulldozers

Predicting the auction sale price for a piece of heavy equipment to create a blue blook for bulldozers.

The key fields are in train.csv are:

  • SalesID: the uniue identifier of the sale
  • MachineID: the unique identifier of a machine. A machine can be sold multiple times
  • saleprice: what the machine sold for at auction (only provided in train.csv)
  • saledate: the date of the sale

Look at the Data

!head /Users/tlee010/kaggle/bulldozers/Train.csv | head -3
!wc  /Users/tlee010/kaggle/bulldozers/Train.csv
  401126 8009543 116403970 /Users/tlee010/kaggle/bulldozers/Train.csv

Using dataframes (Pandas)

import pandas as pd

df_raw = pd.read_csv('/Users/tlee010/kaggle/bulldozers/Train.csv', low_memory=False, parse_dates=["saledate"])

Quick note:

We added a parse_dates option to the read import to force data typing. For a full list of parameters and options, **check pandas documentation **:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

Look at the first few rows

df_raw.head(10)
.dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
SalesID SalePrice MachineID ModelID datasource auctioneerID YearMade MachineHoursCurrentMeter UsageBand saledate ... Undercarriage_Pad_Width Stick_Length Thumb Pattern_Changer Grouser_Type Backhoe_Mounting Blade_Type Travel_Controls Differential_Type Steering_Controls
0 1139246 66000 999089 3157 121 3.0 2004 68.0 Low 11/16/2006 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
1 1139248 57000 117657 77 121 3.0 1996 4640.0 Low 3/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional
2 1139249 10000 434808 7009 121 3.0 2001 2838.0 High 2/26/2004 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1139251 38500 1026470 332 121 3.0 2001 3486.0 High 5/19/2011 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1139253 11000 1057373 17311 121 3.0 2007 722.0 Medium 7/23/2009 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 1139255 26500 1001274 4605 121 3.0 2004 508.0 Low 12/18/2008 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 1139256 21000 772701 1937 121 3.0 1993 11540.0 High 8/26/2004 0:00 ... None or Unspecified None or Unspecified None or Unspecified None or Unspecified Double NaN NaN NaN NaN NaN
7 1139261 27000 902002 3539 121 3.0 2001 4883.0 High 11/17/2005 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 1139272 21500 1036251 36003 121 3.0 2008 302.0 Low 8/27/2009 0:00 ... None or Unspecified None or Unspecified None or Unspecified None or Unspecified Double NaN NaN NaN NaN NaN
9 1139275 65000 1016474 3883 121 3.0 1000 20700.0 Medium 8/9/2007 0:00 ... NaN NaN NaN NaN NaN NaN NaN NaN Standard Conventional

10 rows × 53 columns

to see all columns ( custom function)

def display_all(df):
    with pd.option_context("display.max_rows", 1000): 
        with pd.option_context("display.max_columns", 1000): 
            display(df)
display_all(df_raw.tail().transpose())
.dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
401120 401121 401122 401123 401124
SalesID 6333336 6333337 6333338 6333341 6333342
SalePrice 10500 11000 11500 9000 7750
MachineID 1840702 1830472 1887659 1903570 1926965
ModelID 21439 21439 21439 21435 21435
datasource 149 149 149 149 149
auctioneerID 1 1 1 2 2
YearMade 2005 2005 2005 2005 2005
MachineHoursCurrentMeter NaN NaN NaN NaN NaN
UsageBand NaN NaN NaN NaN NaN
saledate 2011-11-02 00:00:00 2011-11-02 00:00:00 2011-11-02 00:00:00 2011-10-25 00:00:00 2011-10-25 00:00:00
fiModelDesc 35NX2 35NX2 35NX2 30NX 30NX
fiBaseModel 35 35 35 30 30
fiSecondaryDesc NX NX NX NX NX
fiModelSeries 2 2 2 NaN NaN
fiModelDescriptor NaN NaN NaN NaN NaN
ProductSize Mini Mini Mini Mini Mini
fiProductClassDesc Hydraulic Excavator, Track - 3.0 to 4.0 Metric... Hydraulic Excavator, Track - 3.0 to 4.0 Metric... Hydraulic Excavator, Track - 3.0 to 4.0 Metric... Hydraulic Excavator, Track - 2.0 to 3.0 Metric... Hydraulic Excavator, Track - 2.0 to 3.0 Metric...
state Maryland Maryland Maryland Florida Florida
ProductGroup TEX TEX TEX TEX TEX
ProductGroupDesc Track Excavators Track Excavators Track Excavators Track Excavators Track Excavators
Drive_System NaN NaN NaN NaN NaN
Enclosure EROPS EROPS EROPS EROPS EROPS
Forks NaN NaN NaN NaN NaN
Pad_Type NaN NaN NaN NaN NaN
Ride_Control NaN NaN NaN NaN NaN
Stick NaN NaN NaN NaN NaN
Transmission NaN NaN NaN NaN NaN
Turbocharged NaN NaN NaN NaN NaN
Blade_Extension NaN NaN NaN NaN NaN
Blade_Width NaN NaN NaN NaN NaN
Enclosure_Type NaN NaN NaN NaN NaN
Engine_Horsepower NaN NaN NaN NaN NaN
Hydraulics Auxiliary Standard Auxiliary Standard Standard
Pushblock NaN NaN NaN NaN NaN
Ripper NaN NaN NaN NaN NaN
Scarifier NaN NaN NaN NaN NaN
Tip_Control NaN NaN NaN NaN NaN
Tire_Size NaN NaN NaN NaN NaN
Coupler None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Coupler_System NaN NaN NaN NaN NaN
Grouser_Tracks NaN NaN NaN NaN NaN
Hydraulics_Flow NaN NaN NaN NaN NaN
Track_Type Steel Steel Steel Steel Steel
Undercarriage_Pad_Width None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Stick_Length None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Thumb None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Pattern_Changer None or Unspecified None or Unspecified None or Unspecified None or Unspecified None or Unspecified
Grouser_Type Double Double Double Double Double
Backhoe_Mounting NaN NaN NaN NaN NaN
Blade_Type NaN NaN NaN NaN NaN
Travel_Controls NaN NaN NaN NaN NaN
Differential_Type NaN NaN NaN NaN NaN
Steering_Controls NaN NaN NaN NaN NaN
display_all(df_raw.describe(include='all').transpose())
.dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
count unique top freq first last mean std min 25% 50% 75% max
SalesID 401125 NaN NaN NaN NaN NaN 1.91971e+06 909021 1.13925e+06 1.41837e+06 1.63942e+06 2.24271e+06 6.33334e+06
SalePrice 401125 NaN NaN NaN NaN NaN 31099.7 23036.9 4750 14500 24000 40000 142000
MachineID 401125 NaN NaN NaN NaN NaN 1.2179e+06 440992 0 1.0887e+06 1.27949e+06 1.46807e+06 2.48633e+06
ModelID 401125 NaN NaN NaN NaN NaN 6889.7 6221.78 28 3259 4604 8724 37198
datasource 401125 NaN NaN NaN NaN NaN 134.666 8.96224 121 132 132 136 172
auctioneerID 380989 NaN NaN NaN NaN NaN 6.55604 16.9768 0 1 2 4 99
YearMade 401125 NaN NaN NaN NaN NaN 1899.16 291.797 1000 1985 1995 2000 2013
MachineHoursCurrentMeter 142765 NaN NaN NaN NaN NaN 3457.96 27590.3 0 0 0 3025 2.4833e+06
UsageBand 69639 3 Medium 33985 NaN NaN NaN NaN NaN NaN NaN NaN NaN
saledate 401125 3919 2009-02-16 00:00:00 1932 1989-01-17 00:00:00 2011-12-30 00:00:00 NaN NaN NaN NaN NaN NaN NaN
fiModelDesc 401125 4999 310G 5039 NaN NaN NaN NaN NaN NaN NaN NaN NaN
fiBaseModel 401125 1950 580 19798 NaN NaN NaN NaN NaN NaN NaN NaN NaN
fiSecondaryDesc 263934 175 C 43235 NaN NaN NaN NaN NaN NaN NaN NaN NaN
fiModelSeries 56908 122 II 13202 NaN NaN NaN NaN NaN NaN NaN NaN NaN
fiModelDescriptor 71919 139 L 15875 NaN NaN NaN NaN NaN NaN NaN NaN NaN
ProductSize 190350 6 Medium 62274 NaN NaN NaN NaN NaN NaN NaN NaN NaN
fiProductClassDesc 401125 74 Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... 56166 NaN NaN NaN NaN NaN NaN NaN NaN NaN
state 401125 53 Florida 63944 NaN NaN NaN NaN NaN NaN NaN NaN NaN
ProductGroup 401125 6 TEX 101167 NaN NaN NaN NaN NaN NaN NaN NaN NaN
ProductGroupDesc 401125 6 Track Excavators 101167 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Drive_System 104361 4 Two Wheel Drive 46139 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Enclosure 400800 6 OROPS 173932 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Forks 192077 2 None or Unspecified 178300 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Pad_Type 79134 4 None or Unspecified 70614 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Ride_Control 148606 3 No 77685 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Stick 79134 2 Standard 48829 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Transmission 183230 8 Standard 140328 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Turbocharged 79134 2 None or Unspecified 75211 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Blade_Extension 25219 2 None or Unspecified 24692 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Blade_Width 25219 6 14' 9615 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Enclosure_Type 25219 3 None or Unspecified 21923 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Engine_Horsepower 25219 2 No 23937 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hydraulics 320570 12 2 Valve 141404 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Pushblock 25219 2 None or Unspecified 19463 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Ripper 104137 4 None or Unspecified 83452 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Scarifier 25230 2 None or Unspecified 12719 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Tip_Control 25219 3 None or Unspecified 16207 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Tire_Size 94718 17 None or Unspecified 46339 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Coupler 213952 3 None or Unspecified 184582 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Coupler_System 43458 2 None or Unspecified 40430 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Grouser_Tracks 43362 2 None or Unspecified 40515 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Hydraulics_Flow 43362 3 Standard 42784 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Track_Type 99153 2 Steel 84880 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Undercarriage_Pad_Width 99872 19 None or Unspecified 79651 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Stick_Length 99218 29 None or Unspecified 78820 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Thumb 99288 3 None or Unspecified 83093 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Pattern_Changer 99218 3 None or Unspecified 90255 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Grouser_Type 99153 3 Double 84653 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Backhoe_Mounting 78672 2 None or Unspecified 78652 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Blade_Type 79833 10 PAT 38612 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Travel_Controls 79834 7 None or Unspecified 69923 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Differential_Type 69411 4 Standard 68073 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Steering_Controls 69369 5 Conventional 68679 NaN NaN NaN NaN NaN NaN NaN NaN NaN

… now that EDA is complete, let’s model and evaluation

From bulldozer Kaggle overview:

The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

Sample submission files can be downloaded from the data page. Submission files should be formatted as follows:

Have a header: "SalesID,SalePrice"
Contain two columns
SalesID: SalesID for the validation set in sorted order
SalePrice: Your predicted price of the sale

About metrics:

It’s not always MSE, or RMSE, sometimes its profitability, its important to identify and know what the target metric is.

Let’s make it Log sale price

df_raw.SalePrice = np.log(df_raw.SalePrice)

RandomForestRegressor - continous variable

RandomForestClassifier - identifying binary or multiclass categorical variables

RandomForest in general is trivially parallelizable. It’s easy to distribute the work load for a cluster of machines. This is triggered by the parameter:

njobs = -1
#SK LEARN works all teh same way with all algos

# start an ML object (it will start with defaults and empty)
m = RandomForestRegressor(n_jobs=-1)

# then send it the data so the model can be 'FIT'
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-37-c310081c33a1> in <module>()
      1 m = RandomForestRegressor(n_jobs=-1)
----> 2 m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)


~/anaconda/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
    245         """
    246         # Validate or convert input data
--> 247         X = check_array(X, accept_sparse="csc", dtype=DTYPE)
    248         y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
    249         if sample_weight is not None:


~/anaconda/envs/fastai/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    431                                       force_all_finite)
    432     else:
--> 433         array = np.array(array, dtype=dtype, order=order, copy=copy)
    434 
    435         if ensure_2d:


ValueError: could not convert string to float: 'Conventional'

Stack Trace - traces the error through all the nested function calls.

Check the bottom line for the true error ( ‘could not convert string to float’ - ‘conventional’)

This dataset contains a mix of continuous and categorical variables.

The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities.

add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()
0    2006
1    2004
2    2004
3    2011
4    2009
Name: saleYear, dtype: int64

The categorical variables are currently stored as strings, which is inefficient, and doesn’t provide the numeric coding required for a random forest. Therefore we call train_cats to convert strings to pandas categories.

Random forest can handle categorical numbers that are numbers

??add_datepart

Let’s look under the hood

Signature: add_datepart(df, fldname)
Source:   
def add_datepart(df, fldname):
    fld = df[fldname]
    targ_pre = re.sub('[Dd]ate$', '', fldname)
    for n in ('Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
            'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start'):
        df[targ_pre+n] = getattr(fld.dt,n.lower())
    df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
    df.drop(fldname, axis=1, inplace=True)
File:      ~/Desktop/github_repos/fastai/fastai/structured.py
Type:      function

Let’s walk through the function code for this, it uses regex to find the fieldname, and it will replace the date column with multiple columns such as ‘day’,‘month’,‘year’,‘etc’.

Are they useful? Include it, and every variant that might be useful. Max, Min, Mean.

What about the curse of dimensionality - if you have too many columns vs. rows? In practice that doesn’t happen. More date the better.

No Free Lunch Theorem

For all datasets, there is no one particularly better theorem. (this means all possible datasets, random or otherwise)

Free Lunch Theorem

In practice that random forests are the best techique for most cases, mainly because most real world problems are not made of random datasets.

Replace categories

??train_cats(df_raw)
Signature: train_cats(df)
Source:   
def train_cats(df):
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
File:      ~/Desktop/github_repos/fastai/fastai/structured.py
Type:      function
df_raw.UsageBand.cat.categories
Index(['High', 'Low', 'Medium'], dtype='object')

What do we do with missing values - lets take a look at the % null

display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
Backhoe_Mounting            0.803872
Blade_Extension             0.937129
Blade_Type                  0.800977
Blade_Width                 0.937129
Coupler                     0.466620
Coupler_System              0.891660
Differential_Type           0.826959
Drive_System                0.739829
Enclosure                   0.000810
Enclosure_Type              0.937129
Engine_Horsepower           0.937129
Forks                       0.521154
Grouser_Tracks              0.891899
Grouser_Type                0.752813
Hydraulics                  0.200823
Hydraulics_Flow             0.891899
MachineHoursCurrentMeter    0.644089
MachineID                   0.000000
ModelID                     0.000000
Pad_Type                    0.802720
Pattern_Changer             0.752651
ProductGroup                0.000000
ProductGroupDesc            0.000000
ProductSize                 0.525460
Pushblock                   0.937129
Ride_Control                0.629527
Ripper                      0.740388
SalePrice                   0.000000
SalesID                     0.000000
Scarifier                   0.937102
Steering_Controls           0.827064
Stick                       0.802720
Stick_Length                0.752651
Thumb                       0.752476
Tip_Control                 0.937129
Tire_Size                   0.763869
Track_Type                  0.752813
Transmission                0.543210
Travel_Controls             0.800975
Turbocharged                0.802720
Undercarriage_Pad_Width     0.751020
UsageBand                   0.826391
YearMade                    0.000000
auctioneerID                0.050199
datasource                  0.000000
fiBaseModel                 0.000000
fiModelDesc                 0.000000
fiModelDescriptor           0.820707
fiModelSeries               0.858129
fiProductClassDesc          0.000000
fiSecondaryDesc             0.342016
saleDay                     0.000000
saleDayofweek               0.000000
saleDayofyear               0.000000
saleElapsed                 0.000000
saleIs_month_end            0.000000
saleIs_month_start          0.000000
saleIs_quarter_end          0.000000
saleIs_quarter_start        0.000000
saleIs_year_end             0.000000
saleIs_year_start           0.000000
saleMonth                   0.000000
saleWeek                    0.000000
saleYear                    0.000000
state                       0.000000
dtype: float64

Save the data with feather format

os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/raw')
??proc_df
Signature: proc_df(df, y_fld, skip_flds=None, do_scale=False, preproc_fn=None, max_n_cat=None, subset=None)
Source:   
def proc_df(df, y_fld, skip_flds=None, do_scale=False,
            preproc_fn=None, max_n_cat=None, subset=None):
    if not skip_flds: skip_flds=[]
    if subset: df = get_sample(df,subset)
    df = df.copy()
    if preproc_fn: preproc_fn(df)
    y = df[y_fld].values
    df.drop(skip_flds+[y_fld], axis=1, inplace=True)

    for n,c in df.items(): fix_missing(df, c, n)
    if do_scale: mapper = scale_vars(df)
    for n,c in df.items(): numericalize(df, c, n, max_n_cat)
    res = [pd.get_dummies(df, dummy_na=True), y]
    if not do_scale: return res
    return res + [mapper]
File:      ~/Desktop/github_repos/fastai/fastai/structured.py
Type:      function

this will go through each column and run the fix missing

??fix_missing
Signature: fix_missing(df, col, name)
Source:   
def fix_missing(df, col, name):
    if is_numeric_dtype(col):
        if pd.isnull(col).sum(): df[name+'_na'] = pd.isnull(col)
        df[name] = col.fillna(col.median())
File:      ~/Desktop/github_repos/fastai/fastai/structured.py
Type:      function

If it’s numeric, create a new column that has the same tells you if missing or not. New column that where missing variables will be replaced with the median value for that continous field.

df, y = proc_df(df_raw, 'SalePrice')
df.columns
Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
       'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'fiModelDesc',
       'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor',
       'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup',
       'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type',
       'Ride_Control', 'Stick', 'Transmission', 'Turbocharged',
       'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower',
       'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control',
       'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks',
       'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width',
       'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type',
       'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
       'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
       'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
       'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
       'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
       'saleElapsed', 'auctioneerID_na', 'MachineHoursCurrentMeter_na'],
      dtype='object')

Run your first RandomForest Model

m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
0.98305857537865071

Is this overfitting?

Setup Test and Train Split

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape
((389125, 66), (389125,), (12000, 66))
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
CPU times: user 1min 17s, sys: 666 ms, total: 1min 17s
Wall time: 15.4 s
[0.09067208421281527, 0.2523758498048438, 0.98281767127977293, 0.88625205413504937]

Wiki thread: lesson 1
Wiki thread: lesson 1
(Jeremy Howard (Admin)) #2

Awesome!!! Thanks so much :slight_smile:

FYI I’ve increased the post size limit and edited your doc to put the full markdown in.


(Chenxi Ge) #3

regarding the command of creating simlink, is it ‘ln’ not ‘lm’?


(Jeremy Howard (Admin)) #4

Yes it’s ln. I’ve made it a wiki thread so you can edit any mistakes you find. Thanks! :slight_smile:


(Solethu) #5

Hello i would like to find out about how the evaluation values might differ between the test set and the validation set.

I ask this given that i entered a kaggle competition using pretty much most of the code just tweaked for the relevant competition. And I was shocked to find that the RMSLE score i got was drastically different to what i predicted


#6

In this example im using :

When using this command on any csv Ive gotten from kaggle, I get this error :
ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

Im using crestle and have tried wget and curl to get my csv in crestle.

Ive used :
!head /home/nbuser/courses/fastai/courses/ml1/data/black_friday/BlackFriday.csv | head -3

and got :

Any ideas whats going on? Ive never had this much of an issue with Pandas read_csv before.


#7

Here is the whole traceback:

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, doublequote, delim_whitespace, low_memory, memory_map, float_precision)
676 skip_blank_lines=skip_blank_lines)
677
–> 678 return _read(filepath_or_buffer, kwds)
679
680 parser_f.name = name

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
444
445 try:
–> 446 data = parser.read(nrows)
447 finally:
448 parser.close()

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
1034 raise ValueError(‘skipfooter not supported for iteration’)
1035
-> 1036 ret = self._engine.read(nrows)
1037
1038 # May alter columns / col_dict

/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py in read(self, nrows)
1846 def read(self, nrows=None):
1847 try:
-> 1848 data = self._reader.read(nrows)
1849 except StopIteration:
1850 if self._first_chunk:

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()

ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2


#8

I think I figured it out, most of the data that says .csv is actually .csv.zip when downloaded. Crestle was having problems unzipping them for some reason. So I dled the data to my computer, unzipped, uploaded to a free hosting site and then used wget to get my .csv into crestle. If their is an easier way please let me know.