Chapter 11: `SiameseTransform` - Contains data leakage?

sambit · April 19, 2021, 3:22am

In Chapter 11’s SiameseTransform, each validation set image is randomly paired with an image drawn from files. However, 80% of the images in files belong to the training set. Isn’t this a data leakage?

Ideally, the validation set should be comprised of images which the model hasn’t seen before. So, in my opinion, each image in the validation set should be paired with another image drawn from the validation set (and each image in the training set should be paired with another image drawn from the training set). Perhaps we can achieve this by creating two separate methods: _draw_from_train and _draw_from_valid?

Full class reproduced below for convenience:

class SiameseTransform(Transform):
    def __init__(self, files, label_func, splits):
        self.labels = files.map(label_func).unique()
        self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}
        self.label_func = label_func
        self.valid = {f: self._draw(f) for f in files[splits[1]]}

    def encodes(self, f):
        f2,t = self.valid.get(f, self._draw(f))
        img1,img2 = PILImage.create(f),PILImage.create(f2)
        return SiameseImage(img1, img2, t)

    def _draw(self, f):
        same = random.random() < 0.5
        cls = self.label_func(f)
        if not same:
            cls = random.choice(L(l for l in self.labels if l != cls))
        return random.choice(self.lbl2files[cls]), same

dhoa · April 20, 2021, 8:23pm

Agree with your idea. The training pairs and validation pairs are picked from the same set, so the validation result might be a good metric. However, with a large dataset, the probability that one pair is picked from the training set is also picked in the validation set is quite small, so I think the leakage is negligible

sambit · April 21, 2021, 2:53am

Hmm.

Actually the same pair of images will never be in both the training set & validation split. The splits variable ensures a clean train-val split for img1. So there is no problem with img1.

The problem is with img2. What’s bothering me is that it is randomly drawn from files (which is 80% train & 20% val).

In my opinion, there should be a clean train-val split for img2 as well. Otherwise, the model may ‘peek’ into the validation set during training.

dhoa · April 21, 2021, 5:02am

For example, in training phase, a pair with img1 from 80% split and img2 from 20% split can be also picked in validation phase (img2 is picked 1st in 20% split and img1 can be picked again by random drawn from the whole 100% files)

But it is clearer like you said with completely separate 2 splits