Tabular data - noob questions

I’m trying to use to process the Titanic data from Kaggle. There are 890 rows of data in 12 columns. So after loading the data into a data frame I’m doing:

path = '../working/models'
dep_var = 'Survived'
valid_idx = list(range(50))
procs = [FillMissing, Categorify, Normalize]
cat_names = ['Name', 'Ticket', 'Cabin', 'Sex', 'Embarked']
data = TabularDataBunch.from_df(path, train_df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)
learn = tabular_learner(data, layers=[4000,2000], metrics=accuracy)
learn.fit_one_cycle(10, max_lr=3e-2)

And getting:

So looks like I’m overfitting like a champ. A couple of questions:

  1. Is it even reasonable to use deep learning over such a small dataset?
  2. Am I doing something obviously wrong here? How can I avoid overfitting?
  3. Are there any guidelines for setting the number of neurons in the hidden layers? The only thing I could find was setting the number in the first layer to 4x the number of inputs.
  4. When I do learn.model I get 171 in_features for the first layer. Shouldn’t that be 890, as the number of inputs?

I’d recommend some of the feature engineering kernels that are available for it. It’s perfectly reasonable and after some feature engineering you get a much higher accuracy.

1 Like

Can you point me to one?

Every high scoring public kernel on the competition uses feature engineering. You could pick any one with a high upvote count.

1 Like

Hii, is a model of 100% accuracy possible? or am i overfitting?


Double check that your dependent variable isn’t in your cat or cont vars, and check on a separate test set to see. Most likely there is leakage somewhere.

ohh…how to use test set to predict?
test set is rows 500 to 600