Titanic Kaggle Comp - How to determine binning size

wgpubs · October 1, 2017, 7:37pm

Working though the Titanic competition on Kaggle and noticed that most submissions bin fields like ‘Age’, creating 4-5 buckets of various age ranges to use for classification in lieu of the 90 or so different values for Age.

My question is: How do you determine the size of your bins?

I haven’t come across any discussion of this in the Titanic notebooks and it seems like most authors are taking a best guess approach after looking at how Age affects Survival. Is that how’s its done? Or is there a statistical approach that can/should be taken instead?

machinethink · October 2, 2017, 1:30pm

It kinda makes sense to use 5 bins here: infant, child, young adult, adult, elderly. This is an example of using “domain knowledge” to create your features.

For example, a baby may have a higher chance of survival than a child because babies tend to be with their mothers. (I’m not sure if this is actually true, but it’s a reasonable assumption to make that babies and children, as a group, have different survival probabilities.)

Eva · March 11, 2019, 10:18pm

Maybe, we could take a step back and look at what information we try to gain using classifications instead of the continuous variable?

I am new to deep learning and this is my first attempt at a kaggle competition so sorry if it’s a strange question! If you are still working on it, please let me know and we could discuss strategies