Tabular FillMissing - how is fillmissing proc calculated?

MorP · June 25, 2020, 3:23pm

Hi,

I’m using fill missing with filling strategy median.
I’ve noticed (using databunch.show_batch()),
that I still get some nan values (on continuous columns) after building my databunch.
here’s my code:

    procs = [FillMissing, Categorify, Normalize]

    databunch = TabularDataBunch.from_df(
    path=model_path,
    df=train_and_dev,
    valid_idx=valid_idx,
    cat_names=categorical_columns,
    cont_names=cont_columns,
    dep_var='outcome',
    procs=procs,
    num_workers=0)

my question is:
is median calculated on the whole data frame or only on train part of the df (excluding val_idx)?
I’m asking this since I might have some columns with nan all along train part and some other value on validation part.

Thanks!

muellerzr · June 25, 2020, 3:27pm

All the procs are calculated on the training dataset