Split_by_idx and valid_idx not working as expected

I just discovered an issue where split_by_idx() isn’t working as I thought it did with TabularList.from_df(). Can anyone tell me a good way to specify a validation set exactly with Fastai, specifically for tabular data? (e.g., if I already have a DataFrame train and a DataFrame valid , then after preparing the Fastai data, I want data.valid_ds to hold the same values as valid, and I definitely don’t want data.train_ds to contain values in valid)

I loaded data as follows:

datas = []
# (Note: I'm using multiple validation and training sets, as I'm doing cross-validation)
for i in range(splits):
    valid = valids[i]
    train = trains[i]
    valid_indx = valid.index
    train_indx = train.index
    data = (TabularList.from_df(df=df[train_cols], path=path, cat_names=cat_names, cont_names=cont_names)
                               .split_by_idx(valid_indx)
                               .label_from_df(cols=dep_var)
                               .databunch())
    datas += [data]

Yet, the following assertion fails:

assert valids[0].index[0] in pd.concat(list(datas[0].valid_ds.x[:]),axis=1).T.index

Somehow, the following is True instead:

valids[0].index[0] in pd.concat(list(datas[0].train_ds.x[:]),axis=1).T.index

This implies that an index from my first validation set valids[0] is in my first training set, datas[0].train_ds. After looking closer at the data, that seems to be the case, and that is a problem

Note: I run into the same problem if I do the following instead:

data = TabularDataBunch.from_df(df=df[train_cols], path=path, cat_names=cat_names, cont_names=cont_names,
                                   valid_idx=valid_indx, dep_var=dep_var)

You can see an example of cross validation here:

But essentially you want split_by_idxs

1 Like

Actually I’m still experiencing a problem where valid is not ever equal to data.valid_ds

actual_valid = pd.concat(list(datas[0].valid_ds.x[:]),axis=1).T # the first actual validation set

# check indices of all desired validation sets against actual_valid
for split in range(splits):
    print(tensor([(idx in valids[split].index) for idx in actual_valid.index]).type(torch.FloatTensor).mean().item())

Returns:

0.19120654463768005
0.21370142698287964
0.20858895778656006
0.1871165633201599
0.19938650727272034

This implies that, on average, about 20% of any of my desired validation sets are in the first actual validation set.

This assumes that indices were preserved in data.valid_ds and data.train_ds, which appears to be the case.


EDIT: Here’s a cleaner, more general test I wrote that checks indices

# check that valid == data.valid_ds
for s in range(splits):
    data = datas[s]
    valid = valids[s]
    train = trains[s]
    
    actual_valid_idx = pd.concat(list(data.valid_ds.x[:]),axis=1).T.index 
    desired_valid_idx = valid.index
    percent_desired = tensor([(idx in actual_valid_idx) for idx in desired_valid_idx]).type(torch.FloatTensor).mean().item()
    assert percent_desired == 1.
    
    actual_train_idx = pd.concat(list(data.train_ds.x[:]),axis=1).T.index
    desired_train_idx = train.index
    percent_desired = tensor([(idx in actual_train_idx) for idx in desired_train_idx]).type(torch.FloatTensor).mean().item()
    assert percent_desired == 1.

Your reproducer doesn’t work on its own so I can’t investigate more, but when I build a DataBunch following the method your indicated with random indices for the split,

actual_valid_idx = pd.concat(list(data.valid_ds.x[:]),axis=1).T.index 

gives me back those exact random split indices.

I figured out the source of the problem. Using split_by_idxs, valid_idx, or split_by_idx will split the DataFrame by the positions corresponding to integers given in each array, not by the actual indices of the DataFrame.

I neglected to mention that my DataFrame was already shuffled prior to passing it to from_df(), so the first entry of my DataFrame was not index 0. It appears that the easiest way around this is to reset the indices of my shuffled DataFrame prior to making a DataBunch.

That said, it might be nice if split_by_idxs was renamed to something like split_by_pos or split_by_positions, or if at a minimum it mentioned in the docs that the dataframe indices have to be in order. Otherwise, this could be a source of bugs that could potentially go unnoticed for other people.

For clarity:

valid = valids[0]
train = trains[0]

valid_indx = np.array([0,1,2,3,4])
train_indx = np.array(train.index)

data = (TabularList.from_df(df=df[train_cols], path=path, cat_names=cat_names, cont_names=cont_names)
                               .split_by_idxs(train_idx=train_indx, valid_idx=valid_indx)
                               .label_from_df(cols=dep_var)
                               .databunch())

actual_valid_idx = pd.concat(list(data.valid_ds.x[:]),axis=1).T.index

Here, actual_valid_idx equals:
Int64Index([4109, 1351, 3736, 4474, 1659], dtype='int64')

Instead of:
Int64Index([0, 1, 2, 3, 4], dtype='int64')

Additionally, df.head().index equals actual_valid_idx exactly.

I personally think the bug and the problem is on pandas, in that instance. The way they handle index and make it different form the position in the dataframe always confuses me and leads to bugs.

split_by_idx is used for all sorts of collections (lists, arrays, tensors) and pandas is the only one that doesn’t follow the intuition behind regular collections. Maybe a note in the documentation telling a user to beware with a dataframe would be welcome? We’d gladly merge such a PR.