I’m working on a tabular dataset, and have tried to apply the relatively standard fastai methods:
Neural net:
splits = RandomSplitter()(range_of(df_short))
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
dl = TabularDataLoaders.from_df(df_short, y_names=dep_var,
cat_names = cat,
cont_names = cont,
procs = [Categorify, FillMissing, Normalize], splits=splits
)
learn = tabular_learner(dl, metrics=mse
, loss_func=MSELossFlat(), valid_idx=df_valid
)
learn.fit_one_cycle(4, lr= 0.009120108559727669)
The results are not great:
epoch | train_loss | valid_loss | mse | time |
---|---|---|---|---|
0 | 16.669508 | 12.575871 | 12.575871 | 00:27 |
1 | 11.834690 | 11.773903 | 11.773903 | 00:29 |
2 | 9.304846 | 11.929216 | 11.929216 | 00:29 |
3 | 6.391390 | 12.667510 | 12.667510 | 00:28 |
However! When I try a random forest, the results are much better:
splits = RandomSplitter()(range_of(df_short))
procs = [Categorify, FillMissing, Normalize]
cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
to = TabularPandas(df_short, procs, cat, cont, y_names=dep_var, splits=splits)
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
def rf(xs, y, n_estimators=40, max_samples=31664,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)
m = rf(xs, y);
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)
Resulting in:
(2.070548, 3.202993)
From what I’ve read and learned the first method should at least be competitive with the second. The complete notebook is here. I cannot share the dataset so the output is pretty bare, I’m just wondering if there are any obvious mistakes I’m making.