Hi all,
I decided to try experimenting with Kaggle’s Titanic dataset today. I put together a model using TabularList and trained it. I think this is the relevant part of my code:
test = TabularList.from_df(train.iloc[valid_idx].copy(), path=path, cat_names=categorical, cont_names=cont_names)
data = (TabularList.from_df(train, cat_names=categorical, cont_names=cont_names, procs=procs)
.split_by_idx(list(valid_idx))
.label_from_df(cols=dependent_var)
.add_test(test)
.databunch())
learn = tabular_learner(data, layers=[200, 100], metrics=[accuracy, error_rate])
learn.fit_one_cycle(4)
After training, these were my results:
epoch train_loss valid_loss accuracy
1 0.658369 0.661082 0.640000
2 0.517942 0.562670 0.670000
3 0.345470 0.274062 1.000000
4 0.241365 0.139960 1.000000
I was confused when I saw the perfect accuracy, but I tried submitting my results to Kaggle anyways. They gave me a score of 0.57, so it’s pretty clear that the model isn’t doing what I wanted it to.
Until about 15 minutes ago, I thought that overfitting happened when train loss was less than the valid loss, and not the other way around, so I guess it’s overfitting here. It happened pretty quickly, so just running for less epochs doesn’t seem like the right solution. What is? Should I be passing in an explicit learning rate for the first run? I’ve actually seen the trend of models with lower train loss than valid loss on most of my experiments after running for very few epochs, and I’m not sure how to deal with it.
I’m assuming the perfect accuracy is due to the overfitting, but if it’s not… what could be going on here?
Thanks!