My model has 100% accuracy - what did I do wrong?

Hi all,

I decided to try experimenting with Kaggle’s Titanic dataset today. I put together a model using TabularList and trained it. I think this is the relevant part of my code:

test = TabularList.from_df(train.iloc[valid_idx].copy(), path=path, cat_names=categorical, cont_names=cont_names)

data = (TabularList.from_df(train, cat_names=categorical, cont_names=cont_names, procs=procs)

        .split_by_idx(list(valid_idx))

        .label_from_df(cols=dependent_var)

        .add_test(test)

        .databunch())

learn = tabular_learner(data, layers=[200, 100], metrics=[accuracy, error_rate])

learn.fit_one_cycle(4)

After training, these were my results:

epoch train_loss valid_loss accuracy
1 0.658369 0.661082 0.640000
2 0.517942 0.562670 0.670000
3 0.345470 0.274062 1.000000
4 0.241365 0.139960 1.000000

I was confused when I saw the perfect accuracy, but I tried submitting my results to Kaggle anyways. They gave me a score of 0.57, so it’s pretty clear that the model isn’t doing what I wanted it to.

Until about 15 minutes ago, I thought that overfitting happened when train loss was less than the valid loss, and not the other way around, so I guess it’s overfitting here. It happened pretty quickly, so just running for less epochs doesn’t seem like the right solution. What is? Should I be passing in an explicit learning rate for the first run? I’ve actually seen the trend of models with lower train loss than valid loss on most of my experiments after running for very few epochs, and I’m not sure how to deal with it.

I’m assuming the perfect accuracy is due to the overfitting, but if it’s not… what could be going on here?

Thanks!

You’re not overfitting - you’re underfitting. Your training loss should be below your validation loss. As to the 100% accuracy, how big is your validation set? If the validation set is too small then the model might just ‘get lucky’ and get 100% on those few datapoints. The general size of the validation set is 20% of the train set.

Looking at the code, I think your test set and validation set are identical as both are using valid_idx to select rows.

The call to split_by_idx() is what selects the validation set and the call to add_test() adds your test data.

Thanks for clarifying that Tom.

Peter: Yup, I think that was the problem. Thanks!