I finished my first Kaggle competition and got a surprisingly good result for the dataset the first time I ran the model.
However, when I tried to separate the set in a training and a validation set, I have had worse results.
These are the results for the whole set:
[0.06640609395146409, 0.059749794949568814, 0.9724835235033371, 0.9731294015377561]
And those are the scores for the training and validation tests (I used a validation set of 43 rows, it’s roughly 0.02 percent of the whole set):
[0.06867925757638324, 0.15292615297037612, 0.9705674334979815, 0.8239775636830122]
By separating the set in a validation set and a training set, I fell well behind in the leaderboard (like in the 75%).
While using the whole set (I hope I didn’t make any mistake), it got me to the first place on the leaderboard.
Why is there so much difference in the predictions? Is it because my training set is too small to divide it up into two sets?
What’s the conclusion? Should you only divide your set when it is large enough?