I just discovered an issue where split_by_idx()
isn’t working as I thought it did with TabularList.from_df()
. Can anyone tell me a good way to specify a validation set exactly with Fastai, specifically for tabular data? (e.g., if I already have a DataFrame train
and a DataFrame valid
, then after preparing the Fastai data, I want data.valid_ds
to hold the same values as valid
, and I definitely don’t want data.train_ds
to contain values in valid
)
I loaded data as follows:
datas = []
# (Note: I'm using multiple validation and training sets, as I'm doing cross-validation)
for i in range(splits):
valid = valids[i]
train = trains[i]
valid_indx = valid.index
train_indx = train.index
data = (TabularList.from_df(df=df[train_cols], path=path, cat_names=cat_names, cont_names=cont_names)
.split_by_idx(valid_indx)
.label_from_df(cols=dep_var)
.databunch())
datas += [data]
Yet, the following assertion fails:
assert valids[0].index[0] in pd.concat(list(datas[0].valid_ds.x[:]),axis=1).T.index
Somehow, the following is True instead:
valids[0].index[0] in pd.concat(list(datas[0].train_ds.x[:]),axis=1).T.index
This implies that an index from my first validation set valids[0]
is in my first training set, datas[0].train_ds
. After looking closer at the data, that seems to be the case, and that is a problem
Note: I run into the same problem if I do the following instead:
data = TabularDataBunch.from_df(df=df[train_cols], path=path, cat_names=cat_names, cont_names=cont_names,
valid_idx=valid_indx, dep_var=dep_var)