Understanding splitter - Lesson 6

toannguyen · December 1, 2022, 12:04am

Hello everyone, how are you doing?

I’m going through Lesson 6 currently. I’m stuck at the splitter. I understand that this part guide the DataBlock on how to split the data set into training, validating and testing.

What I’m confused is the code

def splitter(df):
train = df.index[~df[‘is_valid’]].tolist()
valid = df.index[df[‘is_valid’]].tolist()
return train,valid

dblock = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
splitter=splitter,
get_x=get_x,
get_y=get_y)

The “train” and “valid” variable would return the list of indexes at which it is True and False respectively. But I’m not sure how this would help DataBlock split the data set?

I hope you could help me with this.

Thank you

BobMcDear · December 1, 2022, 1:49pm

Hello,

The splitter argument is expected to receive the input items, e.g., image paths, and return the indices for training and validation. For example, if the dataset consists of 6 images with paths ['image0.jpg', 'image1.jpg', 'image2.jpg', 'image3.jpg', 'image4.jpg', 'image5.jpg'], where the last 2 are for validation, splitter should yield ([0, 1, 2, 3], [4, 5]). Subsequently, fastai assigns items with indices [0, 1, 2, 3], that is, ['image0.jpg', 'image1.jpg', 'image2.jpg', 'image3.jpg'], to the training set, and those with indices [4, 5], i.e., ['image4.jpg', 'image5.jpg'], to the validation set.

In your case, df contains a column, is_valid, indicating whether each row/item belongs to the validation set or not. Therefore, the training indices are those where df['is_valid'] is False, and the validation indices are those where df['is_valid'] is True.

toannguyen · December 1, 2022, 4:00pm

Thank you so much for explaining it.

I just have a quick following up question. With what you said, I assume it means that the data splitting (training and validating) had already been decided (given the “is_valid” column).

Does it means that we have the choice of either following that “guidance” or choose to split the dataset in our own way?

Thank you

BobMcDear · December 1, 2022, 6:00pm

Exactly, the dataset suppliers have predetermined the validation set and specified it in the is_valid column.

You don’t have to follow df['is_valid'] and can divide the data in whatever fashion you desire, as long as this is not a benchmark or competition where there must a central validation set to ensure an apples-to-apples comparison amongst the participants. However, designing a good validation set is not always trivial and requires diligence, so you’d be saving yourself the trouble by using the provided split.

toannguyen · December 2, 2022, 12:15am

I see.

I assume the problem of randomly splitting the set is that the validating set might be too close of the training, or the training set is not representative of the data set?

With that being said, does it mean that if somehow all data in the data set (pictures for example) are kinda “unique” (no same person/object/animal etc), in the case of image classification, it would be better to randomly split the data set?

BobMcDear · December 2, 2022, 1:24am

Yes, there might be overlap between the training and validation set, which defeats the purpose of having two splits in the first place. This phenomenon is known as leakage, i.e., the training examples “leak” into the validation set.

Even under the assumption that the samples are independently and identically distributed, a random split is not necessarily a prudent choice. For instance, suppose you are developing a super-resolution model for pictures of people and animals, and your dataset consists of 1% people and 99% animals. Randomly splitting the data would yield a similar distribution for both the training and validation set, but you might particularly care about the model’s performance for pictures of people and should hence have a more balanced validation set, perhaps 10% people and 90% animals. Alternatively, you may be aware beforehand that the data distribution during production is, say, 20% people and 80% animals, so again it might be wise for the validation set to reflect the production distribution.

The same argument extends to classification, although it can be somewhat abated through, for example, picking the proper metric or under/oversampling, albeit they do not totally eliminate the problem. For example, in the scenario above, if peoples’ images are very diverse, the aforementioned would not be of much help, and the only real solution is to include more new, diverse pictures of people in the validation set rather than assigning greater weight to the ones already included.

This might be a bit overwhelming, so my advice is to not worry about it for now. When you are merely toying around with a problem, you can use the provided validation set or randomly split the dataset, it ultimately wouldn’t greatly differ. It is only in the later stages of development where more rigour is needed.

toannguyen · December 2, 2022, 3:41am

Thank you so much for detailed response. I really appreciate it. I had learned a lot from this discussion