ML-Lecture2 with Random Forests with in-class Notes
Note, i had to
$source activate fastai
to make sure the library import works otherwise you might have opencv errors.
Loading the libraries from the first two blocks with random forests
%load_ext autoreload
%autoreload 2
# load libraries and edit the modules so you can use them
%matplotlib inline
# allows inline plotting
import sys
sys.path.append("/Users/tlee010/Desktop/github_repos/fastai/") # go to parent dir
from fastai.imports import *
from fastai.structured import *
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor
from IPython.display import display
from sklearn import metrics
Where does a particular python command it come from?
The left most term is the library where the command comes from
display
<function IPython.core.display.display>
what else can we learn about a command?
?display
Source code for the Python?
??display
What parameters does this python function take?
SHIFT + TAB
SHIFT + TAB x2 brings up documentation
SHIFT + TAB x3 new window with documentation
PATH = '/Users/tlee010/Desktop/github_repos/fastai/data/bul'
What is git ignore?
It’s a hidden file that keeps track of files that you dont want to be replicated to the server. .gitignore. You can use wildcards to exclude large numbers of files. For instance tmp* would ignore all the tmp prefixed folders. You can also put it in subdirectories.
This is really useful for ignoring:
- large datasets (don’t want to post on github)
- credentials (keys)
- configuration files
- backup files
- scratch files
What is symlink?
lets you alias folders if you want to point to another folder
# documentation to create a symlink
$man lm
Let’s get the data
Download (need to login) and unzip
If you don’t have unzip:
$brew install unzip
$unzip file.zip
check the file location
!ls -l /Users/tlee010/kaggle/bulldozers/
total 246368
-rwxr-xr-x@ 1 tlee010 staff 116403970 Jan 24 2013 e[1me[31mTrain.csve[me[m
-rw-r--r--@ 1 tlee010 staff 9732240 Oct 26 14:20 Train.zip
Lets look at the system path
Add { } to force it to use bash PATH instead of python PATH
!ls {PATH}
ls: /Users/tlee010/Desktop/github_repos/fastai/data/bul: No such file or directory
How can you download straight to AWS? - Firefox trick
CTRL-SHIFT-I for developer console within Firefox
- If you hit download, and track the actual link that is being used.
- Pause it
- right click the download record, you should get a long curl string It has the cookies / auth details. You can paste in the AWS. Then need to add a ‘-o filename.zip’. That should download very quickly
Bulldozers
Predicting the auction sale price for a piece of heavy equipment to create a blue blook for bulldozers.
The key fields are in train.csv are:
- SalesID: the uniue identifier of the sale
- MachineID: the unique identifier of a machine. A machine can be sold multiple times
- saleprice: what the machine sold for at auction (only provided in train.csv)
- saledate: the date of the sale
Look at the Data
!head /Users/tlee010/kaggle/bulldozers/Train.csv | head -3
!wc /Users/tlee010/kaggle/bulldozers/Train.csv
401126 8009543 116403970 /Users/tlee010/kaggle/bulldozers/Train.csv
Using dataframes (Pandas)
import pandas as pd
df_raw = pd.read_csv('/Users/tlee010/kaggle/bulldozers/Train.csv', low_memory=False, parse_dates=["saledate"])
Quick note:
We added a parse_dates
option to the read import to force data typing. For a full list of parameters and options, **check pandas documentation **:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
Look at the first few rows
df_raw.head(10)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
SalesID | SalePrice | MachineID | ModelID | datasource | auctioneerID | YearMade | MachineHoursCurrentMeter | UsageBand | saledate | ... | Undercarriage_Pad_Width | Stick_Length | Thumb | Pattern_Changer | Grouser_Type | Backhoe_Mounting | Blade_Type | Travel_Controls | Differential_Type | Steering_Controls | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1139246 | 66000 | 999089 | 3157 | 121 | 3.0 | 2004 | 68.0 | Low | 11/16/2006 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
1 | 1139248 | 57000 | 117657 | 77 | 121 | 3.0 | 1996 | 4640.0 | Low | 3/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
2 | 1139249 | 10000 | 434808 | 7009 | 121 | 3.0 | 2001 | 2838.0 | High | 2/26/2004 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 1139251 | 38500 | 1026470 | 332 | 121 | 3.0 | 2001 | 3486.0 | High | 5/19/2011 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 1139253 | 11000 | 1057373 | 17311 | 121 | 3.0 | 2007 | 722.0 | Medium | 7/23/2009 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 | 1139255 | 26500 | 1001274 | 4605 | 121 | 3.0 | 2004 | 508.0 | Low | 12/18/2008 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
6 | 1139256 | 21000 | 772701 | 1937 | 121 | 3.0 | 1993 | 11540.0 | High | 8/26/2004 0:00 | ... | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | Double | NaN | NaN | NaN | NaN | NaN |
7 | 1139261 | 27000 | 902002 | 3539 | 121 | 3.0 | 2001 | 4883.0 | High | 11/17/2005 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | 1139272 | 21500 | 1036251 | 36003 | 121 | 3.0 | 2008 | 302.0 | Low | 8/27/2009 0:00 | ... | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | Double | NaN | NaN | NaN | NaN | NaN |
9 | 1139275 | 65000 | 1016474 | 3883 | 121 | 3.0 | 1000 | 20700.0 | Medium | 8/9/2007 0:00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Standard | Conventional |
10 rows × 53 columns
to see all columns ( custom function)
def display_all(df):
with pd.option_context("display.max_rows", 1000):
with pd.option_context("display.max_columns", 1000):
display(df)
display_all(df_raw.tail().transpose())
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
401120 | 401121 | 401122 | 401123 | 401124 | |
---|---|---|---|---|---|
SalesID | 6333336 | 6333337 | 6333338 | 6333341 | 6333342 |
SalePrice | 10500 | 11000 | 11500 | 9000 | 7750 |
MachineID | 1840702 | 1830472 | 1887659 | 1903570 | 1926965 |
ModelID | 21439 | 21439 | 21439 | 21435 | 21435 |
datasource | 149 | 149 | 149 | 149 | 149 |
auctioneerID | 1 | 1 | 1 | 2 | 2 |
YearMade | 2005 | 2005 | 2005 | 2005 | 2005 |
MachineHoursCurrentMeter | NaN | NaN | NaN | NaN | NaN |
UsageBand | NaN | NaN | NaN | NaN | NaN |
saledate | 2011-11-02 00:00:00 | 2011-11-02 00:00:00 | 2011-11-02 00:00:00 | 2011-10-25 00:00:00 | 2011-10-25 00:00:00 |
fiModelDesc | 35NX2 | 35NX2 | 35NX2 | 30NX | 30NX |
fiBaseModel | 35 | 35 | 35 | 30 | 30 |
fiSecondaryDesc | NX | NX | NX | NX | NX |
fiModelSeries | 2 | 2 | 2 | NaN | NaN |
fiModelDescriptor | NaN | NaN | NaN | NaN | NaN |
ProductSize | Mini | Mini | Mini | Mini | Mini |
fiProductClassDesc | Hydraulic Excavator, Track - 3.0 to 4.0 Metric... | Hydraulic Excavator, Track - 3.0 to 4.0 Metric... | Hydraulic Excavator, Track - 3.0 to 4.0 Metric... | Hydraulic Excavator, Track - 2.0 to 3.0 Metric... | Hydraulic Excavator, Track - 2.0 to 3.0 Metric... |
state | Maryland | Maryland | Maryland | Florida | Florida |
ProductGroup | TEX | TEX | TEX | TEX | TEX |
ProductGroupDesc | Track Excavators | Track Excavators | Track Excavators | Track Excavators | Track Excavators |
Drive_System | NaN | NaN | NaN | NaN | NaN |
Enclosure | EROPS | EROPS | EROPS | EROPS | EROPS |
Forks | NaN | NaN | NaN | NaN | NaN |
Pad_Type | NaN | NaN | NaN | NaN | NaN |
Ride_Control | NaN | NaN | NaN | NaN | NaN |
Stick | NaN | NaN | NaN | NaN | NaN |
Transmission | NaN | NaN | NaN | NaN | NaN |
Turbocharged | NaN | NaN | NaN | NaN | NaN |
Blade_Extension | NaN | NaN | NaN | NaN | NaN |
Blade_Width | NaN | NaN | NaN | NaN | NaN |
Enclosure_Type | NaN | NaN | NaN | NaN | NaN |
Engine_Horsepower | NaN | NaN | NaN | NaN | NaN |
Hydraulics | Auxiliary | Standard | Auxiliary | Standard | Standard |
Pushblock | NaN | NaN | NaN | NaN | NaN |
Ripper | NaN | NaN | NaN | NaN | NaN |
Scarifier | NaN | NaN | NaN | NaN | NaN |
Tip_Control | NaN | NaN | NaN | NaN | NaN |
Tire_Size | NaN | NaN | NaN | NaN | NaN |
Coupler | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Coupler_System | NaN | NaN | NaN | NaN | NaN |
Grouser_Tracks | NaN | NaN | NaN | NaN | NaN |
Hydraulics_Flow | NaN | NaN | NaN | NaN | NaN |
Track_Type | Steel | Steel | Steel | Steel | Steel |
Undercarriage_Pad_Width | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Stick_Length | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Thumb | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Pattern_Changer | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified | None or Unspecified |
Grouser_Type | Double | Double | Double | Double | Double |
Backhoe_Mounting | NaN | NaN | NaN | NaN | NaN |
Blade_Type | NaN | NaN | NaN | NaN | NaN |
Travel_Controls | NaN | NaN | NaN | NaN | NaN |
Differential_Type | NaN | NaN | NaN | NaN | NaN |
Steering_Controls | NaN | NaN | NaN | NaN | NaN |
display_all(df_raw.describe(include='all').transpose())
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
count | unique | top | freq | first | last | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SalesID | 401125 | NaN | NaN | NaN | NaN | NaN | 1.91971e+06 | 909021 | 1.13925e+06 | 1.41837e+06 | 1.63942e+06 | 2.24271e+06 | 6.33334e+06 |
SalePrice | 401125 | NaN | NaN | NaN | NaN | NaN | 31099.7 | 23036.9 | 4750 | 14500 | 24000 | 40000 | 142000 |
MachineID | 401125 | NaN | NaN | NaN | NaN | NaN | 1.2179e+06 | 440992 | 0 | 1.0887e+06 | 1.27949e+06 | 1.46807e+06 | 2.48633e+06 |
ModelID | 401125 | NaN | NaN | NaN | NaN | NaN | 6889.7 | 6221.78 | 28 | 3259 | 4604 | 8724 | 37198 |
datasource | 401125 | NaN | NaN | NaN | NaN | NaN | 134.666 | 8.96224 | 121 | 132 | 132 | 136 | 172 |
auctioneerID | 380989 | NaN | NaN | NaN | NaN | NaN | 6.55604 | 16.9768 | 0 | 1 | 2 | 4 | 99 |
YearMade | 401125 | NaN | NaN | NaN | NaN | NaN | 1899.16 | 291.797 | 1000 | 1985 | 1995 | 2000 | 2013 |
MachineHoursCurrentMeter | 142765 | NaN | NaN | NaN | NaN | NaN | 3457.96 | 27590.3 | 0 | 0 | 0 | 3025 | 2.4833e+06 |
UsageBand | 69639 | 3 | Medium | 33985 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
saledate | 401125 | 3919 | 2009-02-16 00:00:00 | 1932 | 1989-01-17 00:00:00 | 2011-12-30 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiModelDesc | 401125 | 4999 | 310G | 5039 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiBaseModel | 401125 | 1950 | 580 | 19798 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiSecondaryDesc | 263934 | 175 | C | 43235 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiModelSeries | 56908 | 122 | II | 13202 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiModelDescriptor | 71919 | 139 | L | 15875 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ProductSize | 190350 | 6 | Medium | 62274 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
fiProductClassDesc | 401125 | 74 | Backhoe Loader - 14.0 to 15.0 Ft Standard Digg... | 56166 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
state | 401125 | 53 | Florida | 63944 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ProductGroup | 401125 | 6 | TEX | 101167 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ProductGroupDesc | 401125 | 6 | Track Excavators | 101167 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Drive_System | 104361 | 4 | Two Wheel Drive | 46139 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Enclosure | 400800 | 6 | OROPS | 173932 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Forks | 192077 | 2 | None or Unspecified | 178300 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pad_Type | 79134 | 4 | None or Unspecified | 70614 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ride_Control | 148606 | 3 | No | 77685 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Stick | 79134 | 2 | Standard | 48829 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Transmission | 183230 | 8 | Standard | 140328 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Turbocharged | 79134 | 2 | None or Unspecified | 75211 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Blade_Extension | 25219 | 2 | None or Unspecified | 24692 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Blade_Width | 25219 | 6 | 14' | 9615 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Enclosure_Type | 25219 | 3 | None or Unspecified | 21923 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Engine_Horsepower | 25219 | 2 | No | 23937 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Hydraulics | 320570 | 12 | 2 Valve | 141404 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pushblock | 25219 | 2 | None or Unspecified | 19463 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Ripper | 104137 | 4 | None or Unspecified | 83452 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Scarifier | 25230 | 2 | None or Unspecified | 12719 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Tip_Control | 25219 | 3 | None or Unspecified | 16207 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Tire_Size | 94718 | 17 | None or Unspecified | 46339 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Coupler | 213952 | 3 | None or Unspecified | 184582 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Coupler_System | 43458 | 2 | None or Unspecified | 40430 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Grouser_Tracks | 43362 | 2 | None or Unspecified | 40515 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Hydraulics_Flow | 43362 | 3 | Standard | 42784 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Track_Type | 99153 | 2 | Steel | 84880 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Undercarriage_Pad_Width | 99872 | 19 | None or Unspecified | 79651 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Stick_Length | 99218 | 29 | None or Unspecified | 78820 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Thumb | 99288 | 3 | None or Unspecified | 83093 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Pattern_Changer | 99218 | 3 | None or Unspecified | 90255 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Grouser_Type | 99153 | 3 | Double | 84653 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Backhoe_Mounting | 78672 | 2 | None or Unspecified | 78652 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Blade_Type | 79833 | 10 | PAT | 38612 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Travel_Controls | 79834 | 7 | None or Unspecified | 69923 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Differential_Type | 69411 | 4 | Standard | 68073 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Steering_Controls | 69369 | 5 | Conventional | 68679 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
… now that EDA is complete, let’s model and evaluation
From bulldozer Kaggle overview:
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
Sample submission files can be downloaded from the data page. Submission files should be formatted as follows:
Have a header: "SalesID,SalePrice"
Contain two columns
SalesID: SalesID for the validation set in sorted order
SalePrice: Your predicted price of the sale
About metrics:
It’s not always MSE, or RMSE, sometimes its profitability, its important to identify and know what the target metric is.
Let’s make it Log sale price
df_raw.SalePrice = np.log(df_raw.SalePrice)
RandomForestRegressor - continous variable
RandomForestClassifier - identifying binary or multiclass categorical variables
RandomForest in general is trivially parallelizable. It’s easy to distribute the work load for a cluster of machines. This is triggered by the parameter:
njobs = -1
#SK LEARN works all teh same way with all algos
# start an ML object (it will start with defaults and empty)
m = RandomForestRegressor(n_jobs=-1)
# then send it the data so the model can be 'FIT'
m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-37-c310081c33a1> in <module>()
1 m = RandomForestRegressor(n_jobs=-1)
----> 2 m.fit(df_raw.drop('SalePrice', axis=1), df_raw.SalePrice)
~/anaconda/envs/fastai/lib/python3.6/site-packages/sklearn/ensemble/forest.py in fit(self, X, y, sample_weight)
245 """
246 # Validate or convert input data
--> 247 X = check_array(X, accept_sparse="csc", dtype=DTYPE)
248 y = check_array(y, accept_sparse='csc', ensure_2d=False, dtype=None)
249 if sample_weight is not None:
~/anaconda/envs/fastai/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
431 force_all_finite)
432 else:
--> 433 array = np.array(array, dtype=dtype, order=order, copy=copy)
434
435 if ensure_2d:
ValueError: could not convert string to float: 'Conventional'
Stack Trace - traces the error through all the nested function calls.
Check the bottom line for the true error ( ‘could not convert string to float’ - ‘conventional’)
This dataset contains a mix of continuous and categorical variables.
The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities.
add_datepart(df_raw, 'saledate')
df_raw.saleYear.head()
0 2006
1 2004
2 2004
3 2011
4 2009
Name: saleYear, dtype: int64
The categorical variables are currently stored as strings, which is inefficient, and doesn’t provide the numeric coding required for a random forest. Therefore we call train_cats
to convert strings to pandas categories.
Random forest can handle categorical numbers that are numbers
??add_datepart
Let’s look under the hood
Signature: add_datepart(df, fldname)
Source:
def add_datepart(df, fldname):
fld = df[fldname]
targ_pre = re.sub('[Dd]ate$', '', fldname)
for n in ('Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start'):
df[targ_pre+n] = getattr(fld.dt,n.lower())
df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
df.drop(fldname, axis=1, inplace=True)
File: ~/Desktop/github_repos/fastai/fastai/structured.py
Type: function
Let’s walk through the function code for this, it uses regex to find the fieldname, and it will replace the date column with multiple columns such as ‘day’,‘month’,‘year’,‘etc’.
Are they useful? Include it, and every variant that might be useful. Max, Min, Mean.
What about the curse of dimensionality - if you have too many columns vs. rows? In practice that doesn’t happen. More date the better.
No Free Lunch Theorem
For all datasets, there is no one particularly better theorem. (this means all possible datasets, random or otherwise)
Free Lunch Theorem
In practice that random forests are the best techique for most cases, mainly because most real world problems are not made of random datasets.
Replace categories
??train_cats(df_raw)
Signature: train_cats(df)
Source:
def train_cats(df):
for n,c in df.items():
if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()
File: ~/Desktop/github_repos/fastai/fastai/structured.py
Type: function
df_raw.UsageBand.cat.categories
Index(['High', 'Low', 'Medium'], dtype='object')
What do we do with missing values - lets take a look at the % null
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))
Backhoe_Mounting 0.803872
Blade_Extension 0.937129
Blade_Type 0.800977
Blade_Width 0.937129
Coupler 0.466620
Coupler_System 0.891660
Differential_Type 0.826959
Drive_System 0.739829
Enclosure 0.000810
Enclosure_Type 0.937129
Engine_Horsepower 0.937129
Forks 0.521154
Grouser_Tracks 0.891899
Grouser_Type 0.752813
Hydraulics 0.200823
Hydraulics_Flow 0.891899
MachineHoursCurrentMeter 0.644089
MachineID 0.000000
ModelID 0.000000
Pad_Type 0.802720
Pattern_Changer 0.752651
ProductGroup 0.000000
ProductGroupDesc 0.000000
ProductSize 0.525460
Pushblock 0.937129
Ride_Control 0.629527
Ripper 0.740388
SalePrice 0.000000
SalesID 0.000000
Scarifier 0.937102
Steering_Controls 0.827064
Stick 0.802720
Stick_Length 0.752651
Thumb 0.752476
Tip_Control 0.937129
Tire_Size 0.763869
Track_Type 0.752813
Transmission 0.543210
Travel_Controls 0.800975
Turbocharged 0.802720
Undercarriage_Pad_Width 0.751020
UsageBand 0.826391
YearMade 0.000000
auctioneerID 0.050199
datasource 0.000000
fiBaseModel 0.000000
fiModelDesc 0.000000
fiModelDescriptor 0.820707
fiModelSeries 0.858129
fiProductClassDesc 0.000000
fiSecondaryDesc 0.342016
saleDay 0.000000
saleDayofweek 0.000000
saleDayofyear 0.000000
saleElapsed 0.000000
saleIs_month_end 0.000000
saleIs_month_start 0.000000
saleIs_quarter_end 0.000000
saleIs_quarter_start 0.000000
saleIs_year_end 0.000000
saleIs_year_start 0.000000
saleMonth 0.000000
saleWeek 0.000000
saleYear 0.000000
state 0.000000
dtype: float64
Save the data with feather format
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/raw')
??proc_df
Signature: proc_df(df, y_fld, skip_flds=None, do_scale=False, preproc_fn=None, max_n_cat=None, subset=None)
Source:
def proc_df(df, y_fld, skip_flds=None, do_scale=False,
preproc_fn=None, max_n_cat=None, subset=None):
if not skip_flds: skip_flds=[]
if subset: df = get_sample(df,subset)
df = df.copy()
if preproc_fn: preproc_fn(df)
y = df[y_fld].values
df.drop(skip_flds+[y_fld], axis=1, inplace=True)
for n,c in df.items(): fix_missing(df, c, n)
if do_scale: mapper = scale_vars(df)
for n,c in df.items(): numericalize(df, c, n, max_n_cat)
res = [pd.get_dummies(df, dummy_na=True), y]
if not do_scale: return res
return res + [mapper]
File: ~/Desktop/github_repos/fastai/fastai/structured.py
Type: function
this will go through each column and run the fix missing
??fix_missing
Signature: fix_missing(df, col, name)
Source:
def fix_missing(df, col, name):
if is_numeric_dtype(col):
if pd.isnull(col).sum(): df[name+'_na'] = pd.isnull(col)
df[name] = col.fillna(col.median())
File: ~/Desktop/github_repos/fastai/fastai/structured.py
Type: function
If it’s numeric, create a new column that has the same tells you if missing or not. New column that where missing variables will be replaced with the median value for that continous field.
df, y = proc_df(df_raw, 'SalePrice')
df.columns
Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'fiModelDesc',
'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries', 'fiModelDescriptor',
'ProductSize', 'fiProductClassDesc', 'state', 'ProductGroup',
'ProductGroupDesc', 'Drive_System', 'Enclosure', 'Forks', 'Pad_Type',
'Ride_Control', 'Stick', 'Transmission', 'Turbocharged',
'Blade_Extension', 'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower',
'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control',
'Tire_Size', 'Coupler', 'Coupler_System', 'Grouser_Tracks',
'Hydraulics_Flow', 'Track_Type', 'Undercarriage_Pad_Width',
'Stick_Length', 'Thumb', 'Pattern_Changer', 'Grouser_Type',
'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
'Differential_Type', 'Steering_Controls', 'saleYear', 'saleMonth',
'saleWeek', 'saleDay', 'saleDayofweek', 'saleDayofyear',
'saleIs_month_end', 'saleIs_month_start', 'saleIs_quarter_end',
'saleIs_quarter_start', 'saleIs_year_end', 'saleIs_year_start',
'saleElapsed', 'auctioneerID_na', 'MachineHoursCurrentMeter_na'],
dtype='object')
Run your first RandomForest Model
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)
0.98305857537865071
Is this overfitting?
Setup Test and Train Split
def split_vals(a,n): return a[:n].copy(), a[n:].copy()
n_valid = 12000 # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)
X_train.shape, y_train.shape, X_valid.shape
((389125, 66), (389125,), (12000, 66))
def rmse(x,y): return math.sqrt(((x-y)**2).mean())
def print_score(m):
res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
m.score(X_train, y_train), m.score(X_valid, y_valid)]
if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
print(res)
m = RandomForestRegressor(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)
CPU times: user 1min 17s, sys: 666 ms, total: 1min 17s
Wall time: 15.4 s
[0.09067208421281527, 0.2523758498048438, 0.98281767127977293, 0.88625205413504937]