Titanic: Why did my score go down with this feature engineering?

wcneill · March 22, 2020, 3:42am

Hi all, I just started the course. I’m excited to see where it takes me.

In parallel with this course, I am working on my first Kaggle competition. I based my first submission off of a tutorial I found, and am now working on version two. The first tutorial used feature engineering to create a “Title” feature that ended up adding value.

In version 2, I engineered a “Deck” feature from the existing “Cabin” feature. Unfortunately, there are 1015 missing cabin entries. I decided, after some research into the ship layout, to plot the distribution of decks based on ticket class. This graphic is what I produced to visualize the relationship:

Based on this, I came up with some rough probabilities of where to fit the missing data:

# rough estimate of weights based on graph:
choice_dict = {
    1: [['A', 'B', 'C', 'D', 'E'],
        [.12, .22, .32, .20, .14]],
    2: [['D', 'E', 'F'], 
        [.16, .34, .50]],
    3: [['E', 'F', 'G'],
        [1/3, 1/3, 1/3]]
}

And applied it like so:

for key, value in choice_dict.items():  
    m = (joint_df.Pclass == key) &amp; joint_df.Deck.isnull()
    num = len(joint_df.loc[m, 'Deck'])
    fill_in = np.random.choice(value[0], size=num, p=value[1])
    joint_df.loc[m, 'Deck'] = fill_in

I thought I was being clever, but this caused all of my metrics to drop by about 3%. I’m just looking for some insight into why that might be. I don’t know how to ‘tune’ a model yet, so I figured my best bet for improving scores is working with the pre-processing of the data.

Any insights appreciated.