Image matching - whats the intuition?

Forgive me if this is a known question, but I was wondering about the concept of ‘matching images’.

For example, say I have a dataset consisting of images labelled ‘shirts’ and ‘pants’ as well as some kind of matrix that say whether each shirt matches each pants.

Ultimately, I need a classifier trained to ‘match’ shirts and pants, ie, given a shirt image, get me a output vector with all the ‘pants’ with a 0.0-1.0 value indicating the strength of match.

What would the network architecture look like?

I can imagine getting a vector from a CNN trained to classify images as ‘shirts’ and ‘pants’, but then what? how do I use the vectors to solve the matching problem?

My other thought was its a multi-class classifier, where you input the shirts, and use a one-hot encoding to represent the index of the pants matches. Then use a PCA/cosine distance to find pants that are similar based on the vector from the previous CNN. My concern here is that because it doesn’t ‘see’ the pants, it doesn’t really understand the concept of ‘match’, eg: this stripy shirt matches these plain pants…

I’m confusing myself, so if anyone can help that would be awesome.