Using proc_df when y column has some missing values


(Vishal Pani) #1

I noticed that in the source code of proc_df(), the y column (or the response variable) is extracted before handling missing values.
Shouldn’t this give errors if we have a y column with missing values?

.
.
.       
if y_fld is None: y = None
        else:
            if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
            y = df[y_fld].values  <--- y field extracted here and then dropped
            skip_flds += [y_fld] 
        df.drop(skip_flds, axis=1, inplace=True)

        if na_dict is None: na_dict = {}
        else: na_dict = na_dict.copy()
        na_dict_initial = na_dict.copy()
        for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict)  <--- missing values fixed here**
.
.
.

(Jonas) #2

It’s a bit late but better than never. I guess this is because it doesn’t make sense to use a datapoint (one row) where you don’t know the dependent variable to train your algorithm and you shouldn’t use an average since that would make your algorithm unexact . For the test set, all your dependent variables are missing since they are what you want to predict.