Review for Titanic dataset based on Lesson 1

I took the simple titanic dataset on kaggle to solve based on the first lesson. By using the RandomForrestRegresor, I got an accuracy of 0.78. (Titanic Notebook) It would be great if someone could review it and point out the mistakes / explain a better way to do it. Thanks !

All outputs are broken. Maybe you want to fix it first?

My bad. I just assumed if it works on my machine, it would work on kaggle. This is the updated link.

My 2 cents:

  • Run apply_cats, proc_df and predict on test.csv and submit it in Kaggle. It will be almost certain score worse than your validation set.
  • Data size is very small (891 rows) and I won’t split into train/valid set. Make use of oob_score_ instead.

Thank you ! I got a score of 0.64 based on traning data on the entire dataset using oob. Has this model overfitted ?

Tuning n_estimators, max_features and min_samples_leaf in the RandomForestClassifier can help. Mine got 0.82 oob_score

1 Like

is there a cut off point whereby OBB will be used or not?

It was mentioned in the lecture that high observations will usually neglect the use of OBB, of course titanic dataset is an exception