Overfitting too much on titanic dataset

This is a very basic application of RandomForestRegressor on the titanic dataset from Kaggle. I have done the minimum cleaning possible and just shoved the data into random forest.
titanic-1.pdf (21.2 KB)
The result is that it is overfitting very much How can i improve this model and make it better?
PS: I’m a beginner so please ignore any silly mistakes.

What happens when you leave out the unique passenger identifiers such as PassengerID, Name etc.

1 Like

When i remove just PassengerID and Name the results are almost the same. But when i remove Ticket along with them it improves a little but nothing too solid. Its still below 0.50.

You are using a RadomForrestRegressor (like we did in the lesson), but the Titanic is a classification problem (we are not predicting a price (some value in some range on some scale which is still a good prediction if it is a few percent off) but Survided: yes/no). So you need to use the RandomForrestClassifier class instead. If that doesn’t help, also set the criterion=‘gini’ instead of mse (but I think that happens automatically). After that your scores should be more similar for train/val.

Just as a sidenote: Titanic is a very bad example for checking out lessons 1-3 of the ML course. I had to learn that the hard way. :slight_smile: Rather go for some regression problem like the house prices example.

1 Like

That’s it, Thank you so much @marcmuc. This was really helpful, and I’ll keep that note in mind. Now that I see it was a silly mistake after all.:smiley: