How does `split_by_idxs` work?

ilovescience · February 16, 2019, 1:39am

I wanted to do something custom with my dataset regarding the splitting into training and validation set, and looked into the split_by_idxs functionality. How does this function work? If I have a dataframe which I pass into ImageItemList, do I just give a list of indices of the dataframe that correspond to the training and validation dataset? Are there any examples of this? There doesn’t seem to be much documentation on this section…

Tom2718 · February 16, 2019, 10:15am

That sounds about right. If the indices 100:1000 were for the validation set and the rest for the training set, you could do something like:

data = (ImageItemList.from_df(df)
        .split_by_idx(list(range(100,1000)))             
        ...)

Otherwise you can use split_by_idxs and set both the training and the validation indices instead of just the validation.

soco_loco · February 16, 2019, 5:13pm

Not really adding much to Tom2718’s explanation but adding the relevant referenced code sections below and adaptations of Tom’s example. Tom, wouldn’t you have to specify valid_idx or train_idx for your example before range(100,1000), see my example below? I think you could pass either train or valid in a range or an interrupted list of ranges:

data = (ImageItemList.from_df(df)
    .split_by_idx(valid_idx=range(100,1000))             
    ...)

or

data = (ImageItemList.from_df(df)
    .split_by_idx(train_idx=range(100,1000))             
    ...)

split_by_idxs:

def split_by_idxs(self, train_idx, valid_idx):
    "Split the data between `train_idx` and `valid_idx`."
    return self.split_by_list(self[train_idx], self[valid_idx])

split_by_list:

def split_by_list(self, train, valid):
    "Split the data between `train` and `valid`."
    return self._split(self.path, train, valid)

ilovescience · February 16, 2019, 9:16pm

.split_by_idx is different than .split_by_idxs and .split_by_idx only requires specification of the validation set indices it seems:

split_by_idx [source]
split_by_idx ( valid_idx : Collection [ int ]) → ItemLists
Split the data according to the indexes in valid_idx .

champs.jaideep · January 9, 2021, 11:38am

@ilovescience how we can do this in fastai2…