Review for Titanic dataset based on Lesson 1

(Aditya) #1

I took the simple titanic dataset on kaggle to solve based on the first lesson. By using the RandomForrestRegresor, I got an accuracy of 0.78. (Titanic Notebook) It would be great if someone could review it and point out the mistakes / explain a better way to do it. Thanks !


All outputs are broken. Maybe you want to fix it first?

(Aditya) #3

My bad. I just assumed if it works on my machine, it would work on kaggle. This is the updated link.


My 2 cents:

  • Run apply_cats, proc_df and predict on test.csv and submit it in Kaggle. It will be almost certain score worse than your validation set.
  • Data size is very small (891 rows) and I won’t split into train/valid set. Make use of oob_score_ instead.

(Aditya) #5

Thank you ! I got a score of 0.64 based on traning data on the entire dataset using oob. Has this model overfitted ?


Tuning n_estimators, max_features and min_samples_leaf in the RandomForestClassifier can help. Mine got 0.82 oob_score