In Chapter 11’s SiameseTransform
, each validation set image is randomly paired with an image drawn from files
. However, 80% of the images in files
belong to the training set. Isn’t this a data leakage?
Ideally, the validation set should be comprised of images which the model hasn’t seen before. So, in my opinion, each image in the validation set should be paired with another image drawn from the validation set (and each image in the training set should be paired with another image drawn from the training set). Perhaps we can achieve this by creating two separate methods: _draw_from_train
and _draw_from_valid
?
Full class reproduced below for convenience:
class SiameseTransform(Transform):
def __init__(self, files, label_func, splits):
self.labels = files.map(label_func).unique()
self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}
self.label_func = label_func
self.valid = {f: self._draw(f) for f in files[splits[1]]}
def encodes(self, f):
f2,t = self.valid.get(f, self._draw(f))
img1,img2 = PILImage.create(f),PILImage.create(f2)
return SiameseImage(img1, img2, t)
def _draw(self, f):
same = random.random() < 0.5
cls = self.label_func(f)
if not same:
cls = random.choice(L(l for l in self.labels if l != cls))
return random.choice(self.lbl2files[cls]), same