I came across something very interesting and recent:
DiscoGANs is a new approach to map between two domains. The domains can be anything bag<->shoe; left side of face <-> right side; blond <-> dark hair, etc.
Here is the paper :
Here is an implementation (with lots of pictures):
This sounds similar what Jeremy said that you can stick a CNN between 2 any things and it should work. But without needing supervised data.
I plan to play around with it and report whatever I find.