Structured Data Cleaning Problems - Why are some columns cleaned and not the rest

Hi friends,

I’m currently working on the Rossmann structured data problem (Lesson 4) and can’t quite wrap my head around some parts of the data cleaning / feature engineering section.

These are the columns which I have found to contain NA values.

CompetitionDistance          True
CompetitionOpenSinceMonth    True
CompetitionOpenSinceYear     True
Promo2SinceWeek              True
Promo2SinceYear              True
PromoInterval                True
Max_VisibilityKm             True
Mean_VisibilityKm            True
Min_VisibilitykM             True
Max_Gust_SpeedKm_h           True
CloudCover                   True
Events                       True
State_DE                     True
dtype: bool

Jeremy proceeds to only handle NAs in the following:

 for df in (joined,joined_test):
     df['CompetitionOpenSinceYear'] = df.CompetitionOpenSinceYear.fillna(1900).astype(np.int32)
     df['CompetitionOpenSinceMonth'] = df.CompetitionOpenSinceMonth.fillna(1).astype(np.int32)
     df['Promo2SinceYear'] = df.Promo2SinceYear.fillna(1900).astype(np.int32)
     df['Promo2SinceWeek'] = df.Promo2SinceWeek.fillna(1).astype(np.int32)
  1. Can I ask why are these columns specifically handled and not the rest?

  2. It is also mentioned that we pick an arbitrary signal that doesn’t otherwise appear in the data to replace these NAs, but filling CompetitionOpenSinceMonth with 1 is replacing the NA with a meaningful signal (since 1 corresponds to January).

Would greatly appreciate it if someone could help enlighten me on this. :slight_smile: Thanks in advance!

1 Like