Tabular data - noob questions

deep-johnny · August 5, 2019, 6:39am

I’m trying to use fast.ai to process the Titanic data from Kaggle. There are 890 rows of data in 12 columns. So after loading the data into a data frame I’m doing:

path = '../working/models'
dep_var = 'Survived'
valid_idx = list(range(50))
procs = [FillMissing, Categorify, Normalize]
cat_names = ['Name', 'Ticket', 'Cabin', 'Sex', 'Embarked']
data = TabularDataBunch.from_df(path, train_df, dep_var, valid_idx=valid_idx, procs=procs, cat_names=cat_names)
learn = tabular_learner(data, layers=[4000,2000], metrics=accuracy)
learn.fit_one_cycle(10, max_lr=3e-2)

And getting:
Capture

So looks like I’m overfitting like a champ. A couple of questions:

Is it even reasonable to use deep learning over such a small dataset?
Am I doing something obviously wrong here? How can I avoid overfitting?
Are there any guidelines for setting the number of neurons in the hidden layers? The only thing I could find was setting the number in the first layer to 4x the number of inputs.
When I do learn.model I get 171 in_features for the first layer. Shouldn’t that be 890, as the number of inputs?

muellerzr · August 5, 2019, 9:31am

I’d recommend some of the feature engineering kernels that are available for it. It’s perfectly reasonable and after some feature engineering you get a much higher accuracy.

deep-johnny · August 6, 2019, 4:32am

Can you point me to one?

abyaadrafid · August 6, 2019, 4:40am

Every high scoring public kernel on the competition uses feature engineering. You could pick any one with a high upvote count.

AjayStark · October 6, 2019, 2:31pm

@muellerzr
Hii, is a model of 100% accuracy possible? or am i overfitting?

Thanks.

muellerzr · October 6, 2019, 2:40pm

Double check that your dependent variable isn’t in your cat or cont vars, and check on a separate test set to see. Most likely there is leakage somewhere.

AjayStark · October 6, 2019, 5:54pm

@muellerzr
ohh…how to use test set to predict?
test set is rows 500 to 600