Regression using fastai DL

joresh · February 18, 2019, 6:37am

I am trying to solve the Kaggle Bulldozer Bluebook challenge using fastai DL. I am using the fastai.tabular functions. Train.csv can be downloaded from https://www.kaggle.com/c/bluebook-for-bulldozers/data

My code is below

from fastai.tabular import *
import pandas as pd
train_df = pd.read_csv(path/‘Train.csv’, low_memory=False, parse_dates=[‘saledate’])

#Change SalePrice to log because the evaluation is for RMSLE
train_df.SalePrice = np.log(train_df.SalePrice)
#Change dates to date parts
add_datepart(train_df, ‘saledate’)
#Add a column for age of bulldozer
train_df[‘age’] = train_df[‘saleYear’] - train_df[‘YearMade’]

dep_var = ‘SalePrice’
cat_names = [‘SalesID’,‘MachineID’, ‘ModelID’, ‘datasource’, ‘auctioneerID’, ‘YearMade’, ‘UsageBand’, ‘fiModelDesc’, ‘fiBaseModel’, ‘fiSecondaryDesc’, ‘fiModelSeries’,
‘fiModelDescriptor’, ‘ProductSize’, ‘fiProductClassDesc’, ‘state’, ‘ProductGroup’, ‘ProductGroupDesc’, ‘Drive_System’, ‘Enclosure’, ‘Forks’, ‘Pad_Type’, ‘Ride_Control’, ‘Stick’, ‘Transmission’,
‘Turbocharged’, ‘Blade_Extension’, ‘Blade_Width’, ‘Enclosure_Type’, ‘Engine_Horsepower’, ‘Hydraulics’, ‘Pushblock’, ‘Ripper’, ‘Scarifier’, ‘Tip_Control’, ‘Tire_Size’, ‘Coupler’, ‘Coupler_System’,
‘Grouser_Tracks’, ‘Hydraulics_Flow’, ‘Track_Type’, ‘Undercarriage_Pad_Width’, ‘Stick_Length’, ‘Thumb’, ‘Pattern_Changer’, ‘Grouser_Type’, ‘Backhoe_Mounting’, ‘Blade_Type’, ‘Travel_Controls’,
‘Differential_Type’, ‘Steering_Controls’, ‘saleYear’, ‘saleMonth’, ‘saleWeek’, ‘saleDay’, ‘saleDayofweek’, ‘saleDayofyear’, ‘saleIs_month_end’, ‘saleIs_month_start’, ‘saleIs_quarter_end’,
‘saleIs_quarter_start’, ‘saleIs_year_end’, ‘saleIs_year_start’]
cont_names = [‘MachineHoursCurrentMeter’, ‘saleElapsed’, ‘age’]
procs = [FillMissing, Categorify, Normalize]

#Make a subset for doing a trial
df = train_df.head(5000).copy()
#Change all category columns to category type
for col in cat_names:
df[col] = df[col].astype(‘category’)

#Create TabularDataBunch
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_names, procs=procs)
.split_by_idx(list(range(4500,4999)))
.label_from_df(cols=dep_var, label_cls=FloatList)
.databunch())

Gives error: ValueError: Buffer dtype mismatch, expected ‘Python object’ but got ‘unsigned long’ at line “.label_from_df(cols=dep_var, label_cls=FloatList)”

I looked at other threads where users have faced same issue, but found no workable solution. Is the fastai library DL not suitable for regression problems?

Thanks

aeguanga · April 8, 2019, 11:53am

Hi Joresh,

Have you found any solution to the issue you described?

I ran into the same problem but can’t figure out how to fix it.

Any help would be greatly appreciate it!

chrisduvivier · April 10, 2019, 2:08pm

I’m encountering the same error message while running similar code on tabular. The error looks like it’s coming from pandas, and a quick search on Google with this error code leads to several page.

https://github.com/pandas-dev/pandas/issues/18646
https://github.com/pandas-dev/pandas/issues/20986
But I’m still a beginner at coding, and I can’t find out exactly how to fix this…

muellerzr · April 10, 2019, 2:59pm

How I learned to bypass this issue is building a custom item list. Look at the docs and you’ll see that floatList at the end converts it to type float. Just change that ending conversion to what you want for your problem. Let me know if you are having difficulties with that and I can try to write a snippet

chrisduvivier · April 11, 2019, 12:52pm

Converting all categorical variables explicitly to dtype categorical (they were object before) solved the issue for me. (I am doing a Binary classification using tabular).