Image similarity with siamese twins?

I would like to implement a model for face-image verification using a siamese twins. So I get an image of a person and compare this to all possible images. I think this should work but I also think this would not be the smartest possible idea. What do you think? Thanks!

This is a very common usage of siamese network. If you just google ‘siamese network’, there are quite a lot of articles/resources that talk about it, including different variants of the networks and loss functions used.

Good luck, and have fun exploring this!


1 Like

Stuck with Siamese Network for Face Verification.

Hi! I have found a tutorial on the site on how to implement a siamese network:

I reworked this tutorial for face-verfication with the at&t face data set. In theory it works and validation accuracy is above 97% but it completely fails with additional test-data. I have created a set of portrait image of two different people. With the test-data images I achieve accuracy about 50% which means the model is completely useless. I could just throw a dice as well.

Here is a link to my notebook:

Do you have any ideas on how to improve the quality?

I think of these two paths:

  1. I think there aren’t used any additional transforms like contrast, resize, etc. in the tutorial
  2. The training and validation images are greyscale, the test images are colored image


Edit: I have changed my test images to greyscale, seems to improve test-quality a bit.


Firstly, just want to check, you mentioned 97% validation accuracy, but I can’t seem to see/find that in your notebook? I see ~81% accuracy after 10+20 epochs?

In any case, there are a few things that I can think of and comment on. (Disclaimer: I have no idea/proof if what I say below is correct or not…! =P )

I cannot remember off the top of my head what augmentations were applied in the tutorial within after_batch by just using fastai2’s aug_transforms. Your notebook seems to have defined your own set of augs, namely Rotate, Flip, Warp, Zoom, Contrast, Dihedral, and you can always try out different (additional) augs that you think make sense and might improve performance.

Yes, I think coloured vs. greyscale will definitely affect a lot of things, and among other reasons resulted in the output that you got. I will write down some further thoughts about these below.

  1. The example Siamese network backbone uses resnet34 pretrained on imagenet, and so there will be quite some differences between the high-level ‘features’ it recognises (chiefly to do with classifying 1000 imagenet categories of coloured images) and the ‘features’ you need for recognition/classification of greyscale facial images, which require more granularity and subtlety in facial features. When you then gave it test set of coloured images again, it’s kind of like ‘triple-confusing’ to the model (pretrained on colours; finetuned on greyscales; tested on colours)…
  2. Due to the differences, I think you will need to train the model for more epochs in its unfrozen state, to learn to differentiate features between greyscale faces. One key benefit of the Siamese network is that from a relatively small set of training data you can already generate a lot more ‘pairs’ as training input, and so I think you can train for a lot more epochs.
  3. Having said the above, at the end of the day it seems like you only have 40 people ×10 images each for training data, and I don’t have sufficient experience to say whether that’s enough for the model to learn the subtleties in facial feature differences or not. (It looks like it’s not enough)

As far as I know, the typical usage for such ‘face verification’ Siamese model would be:

  • Have a list of identities that are intended to be matched against in deployment, e.g. a ‘staff database’ of 50 people, each providing a few headshots to the training set.
  • Generate Siamese-pairs from the data and train the model to learn the feature differences, which is actually still somewhat ‘specific’ to the headshots of the database of identities.
  • In deployment (i.e. the ‘test’ stage), feed it an input image, and use the Siamese network to compare it with each of the 50 identities in the database (in batch in GPU if useful), to say whether the input is from one of the identities. I guess if necessary it is possible to ‘ensemble’ this test to compare against multiple images of each identity and perhaps get a better output prediction.
  • If there are now new staff, add their headshots into the data set and either retrain altogether (if doable) or just train the saved model more with the new, larger, set of data. In ‘test’ stage, add the new staff into the reference list of images to be compared (now >50 identities).

In your notebook, your ‘test’ is to use the model (which might not have been sufficiently trained, to begin with) to differentiate between two new identities that the model had not seen before. I think this is somewhat different to my understanding of the typical process/usage mentioned above. Add to that the differences in the pretrained features and coloured (pretrained) vs. greyscale (your data), and possibly the low epoch count, I think they all combined to give the low test accuracy that you saw.

I think if you are looking to have a Siamese network that can output ‘similar/dissimilar’ for new images/identities, you will likely need to have a lot more training data (in terms of both variety, i.e. number of identities, and volume, i.e. number of headshots per identity) for the network to actually learn, when trained a lot more in unfrozen state, all the subtleties in facial features. You should also look into different types of loss functions (I think there are ‘triplet loss’ etc.) and ‘similarity metric’ (with threshold) as output, instead of just a probability.

As an aside: I previously mentioned to Sylvain that the tutorial you linked to does not actually implement a common/typical Siamese network model, e.g. it does not do a ‘difference in activations’, but instead concatenates the two sets of activations together for the ‘head’. In a way, I think it shows the power of fastai, where a quick and easy custom architecture (Sylvain did not actually read too much into Siamese network, he kind of just winged it!) can still give pretty good results! One downside is that the model is not ‘symmetrical’, i.e. the input of [img A, img B] will give a different output from [img B, img A], even though we are nominally looking for a single output of ‘how similar is this pair’.

On the other hand, if you are looking for the more typical application of ‘is this image one of my staff’ that I mentioned above, then your ‘test’ stage will be comparing against images of identities that the model had already seen and been trained on, and should give decent accuracy, plus all the benefits of Siamese network (not needing large data set, easier to extend to more identities, etc.).

Hope this helps. Apologies for the ramblings - wasn’t planning to write so much…!



Hi! Thanks for this very detailed answer! I have definitely got some points to work on.

What’s your advice on getting this “A - B” is like “B - A” difference in activations into the model?

Would you suggest to use another another model than resnet34? (I have to admit, by now I not that familiar with using differnent models with


I guess just change the custom model and take the (absolute) difference, instead of concatenating? I think it needs to be abs difference so that it is indeed symmetrical for A-B and B-A.

I wasn’t suggesting that you change to a different pretrained model here – resnet34 is normally a good one to start playing with. I was just saying that an imagenet-pretrained model (e.g. resnet34) will not, by default, suit the application that you have in mind, and you will need to train for more epochs and with more relevant data (e.g. greyscale facial images) first. You can also search around for more relevant pretrained models that might be useful for you to experiment with – a quick search returned something about ‘OpenFace’ pretrained model.

Good luck!


Hi @utkb,

Im currently working with the siamese tutorial with my data. I want to replace the current head+loss of the tutorial (concat embs+linear layers+cross entropy) with other types of metric learning approaches. Im new to metric learning. What are the common approaches (and SOTA) to deal with the pairs of embedings in a siamese model?

  • Absolute difference of embs -> linear layers -> binary classification
  • Cosine similarity of embs -> regression
  • ArcFace loss

Hi Javier,

My quick (read: simple =P ) tests showed that the abs-diff approach gave similar performance as the concat approach that Sylvain took in the tutorial, except the concat approach is not ‘symmetrical’ as I mentioned above.

Unfortunately I haven’t actually had time to look into more details for these. I previously bookmarked this link, which seems to give pretty good coverage of different loss function approaches for metric learning, including some links to implementations in different frameworks (e.g. PyTorch). As far as I could tell, one key thing would be the selection/tuning of the ‘margin’ value, which seems somewhat analogous to tuning ‘threshold’ value, depending on what you value more (precision, recall, FP, FN, etc.).

It would be super interesting to read more about what you find from your work.


1 Like

I know there is a package called pytorch-metric-learning with a lot of losses (potentially useful for siamese tutorial i think) and miners (Mining is the process of finding the best pairs or triplets to train on)

I want to bring some losses to the fastai2 siamese tutorial

1 Like

Great to see that there are more guys interested in the this topic!
I am currently experimenting with a bigger dataset.

I have extracted the faces from this dataset from google:

cleaned to have each face only once in the dataset and augmented each image by setting different contrast and/or rotations.

This created a dataset with 86281 different face images and a total count of 2674176 images for training and validation.

Yet the code from the tutorial to extract the image-names as keys of a dict was extremely slow:

labels = list(set(
lbl2files = {l: [f for f in files if label_func(f) == l] for l in labels}

My “new” dataset is organized into different folders. There is a distinct folder for each category and I found that scandir works a lot faster (see #

def label_func(fname):
    return parent_label(fname)

path = Path()
img_path = path/"google_face_images_dataset/"
files = get_image_files(img_path)

list_subfolders = [f.path for f in os.scandir(img_path) if f.is_dir()]
re_pattern = r"^.*\/([^/]*)$"
lbl2files2 = {str(re.match(re_pattern, l).group(1)): [Path(f.path) for f in os.scandir(l)] for l in list_subfolders}
labels = list(lbl2files2.keys())

Just in case this helps anyone.

1 Like