Losing a single test case in categorical data - Where did the row go?


(Matthew Arthur) #1

I have been trying to apply the fast.ai tabular data learner library to categorical data (for the Kaggle home price competition) and am losing a single result row from my test set, ending with 1458 and not 1459 rows, plus the header row. I would appreciate opinions on how I am accomplishing this. Tks! (code follows)

df = pd.read_csv(‘train.csv’, usecols=[‘Street’, ‘LotShape’, ‘SalePrice’, ‘LotArea’, ‘SaleCondition’, ‘LotFrontage’, ‘MSZoning’, ‘Utilities’, ‘YrSold’])
procs = [FillMissing, Categorify, Normalize]
valid_idx = [0, 1459] . ##also tried valid_idx = range(len(df)-100, len(df))
dep_var = ‘SalePrice’
cat_names = [‘Street’, ‘LotShape’, ‘SaleCondition’, ‘MSZoning’, ‘Utilities’, ‘YrSold’]
cont_names = [‘LotArea’, ‘LotFrontage’]
dftest = pd.read_csv(‘test.csv’, usecols=[‘Street’, ‘LotShape’, ‘LotArea’, ‘SaleCondition’, ‘LotFrontage’, ‘MSZoning’, ‘Utilities’, ‘YrSold’])
test_df = dftest
data = (TabularList.from_df(df, path=path, cat_names=cat_names, cont_names=cont_vars, procs=procs)
.split_by_idx(valid_idx)
.label_from_df(cols=dep_var, label_cls=FloatList, log=False)
.add_test(TabularList.from_df(test_df, path=path, cat_names=cat_names, cont_names=cont_names))
.databunch())
learn = tabular_learner(data, layers=[200, 100], metrics = exp_rmspe)
learn.fit_one_cycle(5, 1e-5)
preds = learn.get_preds(ds_type=‘Test’)

And the length of preds[1] is 1458, not 1459!