Why is splitting done before labelling?

I find I often want to split the data based on the labels

For example:

  • If you wanted to have an even representation of all classes in the validation set.
  • Hide certain classes from training

Labelling involves both associating inputs and labels to create a training example, and choosing an appropriate encoding for the label. The second part of this process, the encoding part, is the part that needs to happen after splitting the data into training and validation sets.

In principle by choosing your encoding based on the whole dataset you could leak information about the the validation set to your model. Though having said that I’m struggling to think of an example that isn’t contrived.

The contrived example I can think of goes something like, encode the class label based on the frequency of class in the dataset, then you’re capturing information about the validation set in your encoding. Anyone have a less contrived example?

This idea of leaking information from the validation set is much clearer in the context of encoding categorical features in a dataset rather than labels. You might choose to encode categories based on the average value of the target variable - this is known as ‘target encoding’. Then if you use the whole dataset to create the encoding you’re effectively using the target values of the validation set to create features in your training set which is obviously leaking information.

In scikit-learn when encoding categorical features or targets you typically take the approach of ‘fitting’ an encoding (based on the training set data) and then applying the resulting transform to your entire dataset (including validation and test sets).

I agree that there are often reasons to want to split your training and validation sets based on the label values. Thinking in terms of the above, this kind of splitting is based on the actual label values and not the way they are encoded for your model. It might be nice to separate the two parts of labelling into an association, and an encoding part. Then you could associate the labels, split the data (possibly based on the associated labels) and then encode the labels.


Can definitely have more fine grained control of this relationship using a dataframe: https://docs.fast.ai/data_block.html#ItemList.split_from_df

If you want to split by labels somehow, then you can always use the split_by_* functions to do the same with a bit more work. Personally I am not completely sure how we could have a generalized split by labels. For example if we were going so split by labels it would be something like: split by 20% of each label.

Even representation of classes is not really a concern, because most of the time we will not have even class representation in training or validation set anyway. This is only a concern when dealing with very limited data, such as identification challenges (one example of each “class”)

Hiding certain classes from training does not make any sense. If you do not show the model the class it will never learn how to identify it.

I feel like if you get to the point you have to worry about splitting classes by label in a specific way, then you have gotten to the point where you will need to use split_by_df or split_by_idx. The only time you need to do that is in a limited data circumstance, which becomes very specific to your available data.