I noticed that in the source code of proc_df(), the y column (or the response variable) is extracted before handling missing values.
Shouldn’t this give errors if we have a y column with missing values?
.
.
.
if y_fld is None: y = None
else:
if not is_numeric_dtype(df[y_fld]): df[y_fld] = df[y_fld].cat.codes
y = df[y_fld].values <--- y field extracted here and then dropped
skip_flds += [y_fld]
df.drop(skip_flds, axis=1, inplace=True)
if na_dict is None: na_dict = {}
else: na_dict = na_dict.copy()
na_dict_initial = na_dict.copy()
for n,c in df.items(): na_dict = fix_missing(df, c, n, na_dict) <--- missing values fixed here**
.
.
.