Review for Titanic dataset based on Lesson 1

adikul · January 1, 2019, 11:44am

I took the simple titanic dataset on kaggle to solve based on the first lesson. By using the RandomForrestRegresor, I got an accuracy of 0.78. (Titanic Notebook) It would be great if someone could review it and point out the mistakes / explain a better way to do it. Thanks !

corvus · January 3, 2019, 2:49am

All outputs are broken. Maybe you want to fix it first?

adikul · January 3, 2019, 1:29pm

My bad. I just assumed if it works on my machine, it would work on kaggle. This is the updated link.

corvus · January 4, 2019, 6:07am

My 2 cents:

Run apply_cats, proc_df and predict on test.csv and submit it in Kaggle. It will be almost certain score worse than your validation set.
Data size is very small (891 rows) and I won’t split into train/valid set. Make use of oob_score_ instead.

adikul · January 5, 2019, 9:11pm

Thank you ! I got a score of 0.64 based on traning data on the entire dataset using oob. Has this model overfitted ?

corvus · January 14, 2019, 7:02am

Tuning n_estimators, max_features and min_samples_leaf in the RandomForestClassifier can help. Mine got 0.82 oob_score

andrew77 · February 20, 2019, 6:55am

is there a cut off point whereby OBB will be used or not?

It was mentioned in the lecture that high observations will usually neglect the use of OBB, of course titanic dataset is an exception