Why split train/valid in ItemList and not in LabelList?

I’d love to understand a design decision behind the v1 data_block api.

If I understand it correctly, one creates an ItemList for the input values, then splits that into an ItemLists object that contains a separate train and validation ItemList, and then stores the target values for each of these, so that ItemLists[ItemList] becomes LabelLists[LabelList].

Why is that done this way around? Why not annotate the data first, i.e. create a LabelList, and then split that LabelList into a LabelLists object with a separate train and validation LabelList object?

In my case, I’m working with audio data as input and midi data as target data. It is a lot more convenient for me to read input and target data from disk at once, rather than trying to get the right target data from my input after I’ve reorganised my input into training and validation set. Am I using the API wrong, or is my case - reading input and target data at the same time - just a lot less common and hence not what the API was designed for?

Because very often, the state of the transforms that creates the labels needs to be determined on the training set: for instance, the classes are determined on the training set, and if you have new labels on the validation set they should be sent to unknown.

Fair enough, that sounds plausible. For my use case, would the following code seem okay to you, or would you recommend to structure it differently?

# Assume the following are defined:
# class InputBase(ItemBase): ...
# class TargetBase(ItemBase): ...

class InputList(ItemList): 
    def reconstruct(self, example): return InputBase(example)
    
class TargetList(ItemList)
    def reconstruct(self, example): return TargetBase(example)

class InputTargetList(LabelList):
    @classmethod
    def from_folder(cls, path):
        xs, ys = [], []
        # in my real-world use case, I'm doing a bit more here, but let's keep this simple
        for filename in Path(path).glob("**/*.x"):
            x = np.load(filename)
            y = np.load(filename.with_suffix(".y"))
            xs.append(x); ys.append(y)
        xs, ys = InputList(xs), TargetList(ys)
        return cls(xs, ys)

    def split_by_percent(self, train, valid, test=0):
        assert train >= 0 and valid >= 0 and test >= 0, "you can't pass negative percentages"
        assert train + valid + test == 1, f"train ({train}), valid ({valid}) and test ({test}) percentages must add up to 1"

        N = len(self)
        i_val, i_test = int(N*train), int(N*(train+valid))

        training_set   = InputTargetList(self.x[:i_val],       self.y[:i_val])
        validation_set = InputTargetList(self.x[i_val:i_test], self.y[i_val:i_test])
        test_set       = InputTargetList(self.x[i_test:],      self.y[i_test:])
        return LabelListsEx(training_set, validation_set, test_set)

class LabelListsEx(LabelLists):
    def __init__(self, train, valid, test):
        self.path = Path(".")
        self.train, self.valid, self.test = train, valid, test

It does look correct, yes

1 Like