'Modding' the Siamese example for metric learning

(I can’t imagine this is a new question but my searches haven’t turned up what I’m looking for.)

I love how the Siamese network example shows how to extend existing DataLoader infrastructure, and then scores such high accuracy so quickly using “merely” the CrossEntropy loss, without any discussion of contrastive/hinge/triplet losses, or mining for “hard examples”.

…I still would like to try some metric learning with fastai (so I can visualize the embeddings), and so I’ve been looking at resources such as the humpback whale Kaggle competition and PyTorch Metric Learning, but not seeing any posts of people sharing working code where they did this with fastaiv2.

So below is my “minimal” work-in-progress code so far, and sadly it doesn’t work (yet), which is why I’m posting – looking for help!

In my Colab notebook, it’s exactly the original Siamese network tutorial except that I’ve modified the model a bit: The original head Flattened and called BatchNorm across both samples at once, and the subsequent Linear layer mixes features from both images before any individual embedding vectors are produced. So…that doesn’t seem to be what you’d want for metric learning. So in my model, instead of attaching the “head” to merge the two encoder outputs directly (as in the original tutorial), I name the (ordinary, non-merging) head “mid” and attach it onto the end of each encoder, then I use a final “head” to merge the vectors produced by the two “mid” sections. Here are the key parts:

class SiameseModel(Module):
    def __init__(self, encoder, mid, head):
        self.encoder, self.mid, self.head = encoder,mid,head
    def forward(self, x1, x2):
        ftrs = torch.cat([self.mid(self.encoder(x1)), self.mid(self.encoder(x2))], dim=1)
        return self.head(ftrs)

encoder = create_body(resnet34, cut=-2)
embed_dim = 512   # i'd like to use only 3 dims, but keeping it large for now
mid = create_head(512, embed_dim, ps=0.5) # not the true head
#head = nn.Seqential( nn.Linear(embed_dim*2, 2)))   # that doesn't work well
head = nn.Sequential(   # Ok, try giving it a bit more nonlinearity on the final end:
    nn.Linear(embed_dim*2, 2))
model = SiameseModel(encoder, mid, head)

def siamese_splitter(model):
    return [params(model.encoder), params(model.mid), params(model.head)]

learn.freeze_to(-2)  # freeze just encoder, but train mid and head

…that’s it. I haven’t even introduced the ContrastiveLoss, metric learning, etc. The code so far should be very similar to the original tutorial…except that mine doesn’t work! :rofl: What I find is that the loss never goes below 0.7, and the accuracy never improves above 50% – i.e., random guessing.

It didn’t help when I tried various sizes of embed_dim (from 512 down to 3) and/or varied the makeup of the final head layer (as you see from the comments above). And unfreezing doesn’t help.

Can anyone offer ideas on why this doesn’t work? (And maybe even how to fix it!)


Idea from @zachmueller: “if it were me, I’d have my “mid” head stop at the first linear layer in fastai’s head, then have the second one be the rest.”

So, trying that (after printing out what create_head typically produces):

mid = nn.Sequential(
    nn.Dropout(p=0.25, inplace=False),
    nn.Linear(in_features=1024, out_features=embed_dim, bias=False)

head = nn.Sequential(  
    nn.Dropout(p=0.5, inplace=False),
    nn.Linear(embed_dim*2, 2, bias=False)

…also doesn’t work. e.g., the LRFinder yields: SuggestedLRs(lr_min=5.248074739938602e-06, lr_steep=2.2908675418875646e-06), much lower than usual (3e-3). And the training results are the same.

BTW: One other idea, in order to keep to only two parameter groups: Instead of introducing a mid, append the mid layers directly onto the end of encoder and just have encoder and (the new, shortened) head.

# add more layers (from typical head) onto encoder
embed_dim = 512 
more_layers = [
    nn.BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),  # = 512*2 because of the concat in AdaptiveConcat
    nn.Dropout(p=0.25, inplace=False),
    nn.Linear(in_features=1024, out_features=embed_dim, bias=False)         

for i in range(len(more_layers)):
    encoder.add_module(f'l+{i}', more_layers[i])

head = nn.Sequential(  
    nn.BatchNorm1d(embed_dim*2, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True),
    nn.Dropout(p=0.5, inplace=False),
    nn.Linear(embed_dim*2, 2, bias=False)
model = SiameseModel(encoder, head)

…and keep everything else the same as in the tutorial (i.e. the original splitter, using regular ‘freeze’ etc). I started a fresh Colab notebook with these changes, but the results are similarly unsuccessful. …BUT looking back,…this is actually what I’d tried originally and I figured it didn’t work because those extra layers added to the encoder weren’t pre-trained, and so their weights are random, i.e. they need training, but they’re frozen! …So this doesn’t work either. :frowning:

UPDATE: Ahhh, I think I know why the results are so different from the original tutorial: In the original model, the AdaptiveConcat2d in the head is operating on concatenated images from both branches – “stacked” as it were. When I broke it up so that AdaptiveConcat2d only operates on one image at a time, that sort of operation no longer occurs. So there is a significant difference between the two different types of models.

We could treat the sets of 7x7 images generated by the encoder as “embedding vectors” and define some kind of metric on them, etc, and maybe visualize them via something like PCA or t-SNE,… but this isn’t what I was hoping to do: I wanted regular “vectors” where I could vary the dimensionality at will. Hmmm. :thinking: This should still possible, however my expectation that I could do this “incrementally” via minimally-modifying the Siamese example was naive.

Happy to hear suggestions on how to proceed. In the meantime, I’ll go ahead with trying to supply contrastive losses…

UPDATE: I seem to have gotten it working!

This is after more help from Zach Mueller, and even got an accuracy metric going. (Spoiler: The contrastive loss model is currently not as accurate as the original version of the tutorial. There are a couple margin variables that could be tweaked, among other things.)

Here is a link to my (updated) Colab notebook.

What I can’t figure out now is the “Making show_results work” part near the end: if the variable y is supposed to contain the predictions, then why don’t I find TWO tensors in it, instead of only one?

Further update:
I went back and put print statements all throughout the fastai source code. It seems that there’s some 'corruption" happening in the decode() step called from data.core.TfmdDL._pre_show_batch: going in, my batch (b_out from data.core.TfmdDL.show_results) contains a list that looks like this: [TensorImage, TensorImage, Tensor, Tensor] – the last two elements are our outputs – but after the decode() the last Tensor somehow becomes a TensorImage. Not sure why or how to fix this. Any ideas?

…I may go back and just package the two outputs together via torch.cat() or something, in order to work around this problem.


ANSWER: GOT IT! I packaged the 2 model outputs into one long vector, and then just cut it in half whenever I want to use them,…and mucked about for a couple days:

This new Colab notebook is a complete working version original Siamese tutorial, except it uses a contrastive loss to learn embeddings. :partying_face:

Spoiler: accuracy is no bettter than ~89%, for the various tweaks I tried.

Next I want to see how low I can make the embedding dimension and still retain accuracy, and look at visualizing the embeddings somehow.