Thoughts on working with small datasets?

Working through the Kaggle Titanic and Mercedes Benz comps, both of which have fairly small train/test datasets (for example, Titanic has < 900 training examples and < 500 test cases).

What are some best practices to follow when the datasets are small?

Intuitively, it would seem that these kind of problems are more prone to overfitting and that things like K-Fold or stratefied K-Fold cross validation would come into play. If anyone deals with small datasets, or even these in particular, I’d really love to see how you dealt with this issue.


I would look at Kaggle problems or ML case studies in the Health domain. Small datasets are unfortunately quite frequent in medical settings where labeled data is expensive and hard to come by.

While there are a number of specific/detailed approaches the standard procedure is to apply a lot of affine transformations to the data: random rotations, translations, zooms and shears.

Here is a good case study using small datasets: Heart disease diagnosis

1 Like

good info Nik! Thanks