Help needed in DataBunch construction

Hello all. I am currently attempting the Recognizing Faces in the Wild challenge on Kaggle. The task is to determine if two people are blood-related based solely on images of their faces.

The labels look like:\


F0002/MID1 and F0002/MID3 mean that In Family F0002, MID1 is related to MID3 (the first row in the above figure).

More information about the dataset:

the training set is divided in Families ( F0123 ), then individuals ( MIDx ). Images in the same MIDx folder belong to the same person. Images in the same F0123 folder belong to the same family.

The training images are contained in a folder which contains images from 470 families and its structure looks like (a small snap):


Now if you zoom into a folder of a particular family, you get:

Given all this data, you are to build a system that would take image-pairs as given in the test set and will predict if they are blood-related or not, The images from the test set look like:


The prediction file contains image-pairs like:


And accordingly, we have to predict if they are related.

On this problem statement, I am struggling to understand how do I make the data bunch to feed that to a model. Any clue and pointers would really be helpful.

1 Like

Hi Sayak,
I had looked into the problem. Not sure if this is a traditional straight forward classification problem .

This is a find-the-distance-between-face-embeddings problem.

Easier way to do would be

  1. Find the face-embeddings of the pairs of faces ( use libraries like dlib ) for this.
  2. Each embedding gives a 128 element vector.
  3. Find the cosine distance / L1 distance between the two faces.
  4. Fix a threhold for distance.
  5. If the cosine distance is less than the threshold, the two images are blood related.

This would give you a decent result in the leaderboard. ( But this is not the only way )

Refrences : ( library in python built on top of dlib ) ( dlib package - amazing C++ library with python bindings )


1 Like

Thank you very much for your suggestion. I had constructed the dataset initially (basically an Image -> Image mapping) from the .csv file. Here’s the Kaggle Kernel of that: After that, I could not figure out how should I proceed. You will see a Unet in the kernel though, but it was kind of a no-brainer.

You can use a Siamese-Network approach enabling the network to recognize if two pictures are blood-related instead of the same person as usual.

Take a look at: Siamese Networks

1 Like