Siamese Example in FastBook

In the siamese example in fastbook pasted below why does the encodes method use this line:

f2,t = self.valid.get(f, self._draw(f))

Wouldn’t this always draw samples from the validation set during training?

from fastai2.vision.all import *
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

class SiameseImage(Tuple):
    def show(self, ctx=None, **kwargs): 
        img1,img2,same_breed = self
        if not isinstance(img1, Tensor):
            if img2.size != img1.size: img2 = img2.resize(img1.size)
            t1,t2 = tensor(img1),tensor(img2)
            t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)
        else: t1,t2 = img1,img2
        line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)
        return show_image(torch.cat([t1,line,t2], dim=2), 
                          title=same_breed, ctx=ctx)
    
def label_func(fname):
    return re.match(r'^(.*)_\d+.jpg$', fname.name).groups()[0]

class SiameseTransform(Transform):
    def __init__(self, files, label_func, splits):
        self.labels = files.map(label_func).unique()
        self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}
        self.label_func = label_func
        self.valid = {f: self._draw(f) for f in files[splits[1]]}
        
    def encodes(self, f):
        f2,t = self.valid.get(f, self._draw(f))
        img1,img2 = PILImage.create(f),PILImage.create(f2)
        return SiameseImage(img1, img2, t)
    
    def _draw(self, f):
        same = random.random() < 0.5
        cls = self.label_func(f)
        if not same: cls = random.choice(L(l for l in self.labels if l != cls)) 
        return random.choice(self.lbl2files[cls]),same
    
splits = RandomSplitter()(files)
tfm = SiameseTransform(files, label_func, splits)
tls = TfmdLists(files, tfm, splits=splits)
dls = tls.dataloaders(after_item=[Resize(224), ToTensor], 
    after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])
2 Likes

TL;DR: The line is ensuring that we always return the same validation data when doing validation.

Calling the method self.valid.get(f, self._draw(f)) will try to return the value associated with the key f, if the key is not present, use the default self._draw(f)

The only keys present on self.valid are the filenames from the valid dataset, so for all training data the default self._draw(f) will always be used, and for validation data we used what we already calculated. This way our validation data is always the same.

1 Like

It appears that the draw method (created in this code) doesn’t make that distinction though. Or am I misunderstanding something?

The _draw method does not make that distinction, and this is exactly why we have all of this.

In the constructor we do self.valid = {f: self._draw(f) for f in files[splits[1]]} so we create all siameses pairs for our validation data only once, _draw will run as always, sometimes returning same pairs sometimes returning different pairs.

The important thing here is to keep all the same validation pairs in the whole experiment (so we don’t get misleaded when computing our validation loss at the end of each epoch)

Why might we get misleaded? If we didn’t made sure we are using the same validation pairs and instead created new pairs at the end of each epoch our loss could decrease without our model actually getting better, but just because we were lucky and those new pairs were easier to predict than the ones on the previous epoch.

3 Likes

I’m thinking…maybe changing the split from splits[1] to split[0] would make it draw from training set during training?