Overfitting too much on titanic dataset

junejasparsh · October 17, 2018, 12:00pm

This is a very basic application of RandomForestRegressor on the titanic dataset from Kaggle. I have done the minimum cleaning possible and just shoved the data into random forest.
titanic-1.pdf (21.2 KB)
The result is that it is overfitting very much How can i improve this model and make it better?
PS: I’m a beginner so please ignore any silly mistakes.

davidh · October 18, 2018, 10:14pm

What happens when you leave out the unique passenger identifiers such as PassengerID, Name etc.

junejasparsh · October 19, 2018, 8:36am

When i remove just PassengerID and Name the results are almost the same. But when i remove Ticket along with them it improves a little but nothing too solid. Its still below 0.50.

marcmuc · October 19, 2018, 9:26am

You are using a RadomForrestRegressor (like we did in the lesson), but the Titanic is a classification problem (we are not predicting a price (some value in some range on some scale which is still a good prediction if it is a few percent off) but Survided: yes/no). So you need to use the RandomForrestClassifier class instead. If that doesn’t help, also set the criterion=‘gini’ instead of mse (but I think that happens automatically). After that your scores should be more similar for train/val.

Just as a sidenote: Titanic is a very bad example for checking out lessons 1-3 of the ML course. I had to learn that the hard way. Rather go for some regression problem like the house prices example.

junejasparsh · October 19, 2018, 1:26pm

That’s it, Thank you so much @marcmuc. This was really helpful, and I’ll keep that note in mind. Now that I see it was a silly mistake after all.