Is it straightforward to train a neural net that takes style image as input and applies that to a (fixed) content image?

Apologies for the somewhat vague title. It is easier to summarize in a brief paragraph what I’m getting at:

In the fast neural style transfer (using perceptual loss), the idea is to pick a given style image (say Picasso’s The Muse) and spend some upfront time training a transformer network which, in its forward pass, transforms any input content image into one with the style of The Muse applied to it.

This is attractive for apps and websites that let you upload your own picture and quickly apply one of the offered styles to it: Lots of potential content input images, few available style images.

Now I have a different scenario in mind: What if I have one particular content image in mind, and want to quickly apply any number of styles to it? Let’s say I take a selfie and want to quickly check how it’d look like in Picasso style, Munch’s The Scream style, Van Gogh’s self portrait style, and a large number of other inputs.

So now I’d want a transformer network whose input is a style image, and whose output is an image with that same style, but has my selfie as the content.

So in short: Standard fast neural transfer network has variable content and fixed style. What I’m after is a network that has variable style and fixed content.

Looking at the architecture used for the fast neural transfer, it seems that I’d just need to rewrite some of the plumbing: The image transformation network that we’re training gets the training image as input. The output is then fed into something like a pre-trained VGG16. Also fed into VGG16 are the content image and the style image. In the “standard” architecture, the content image is the same as the training image (the input), whereas the style image is always the same. Now my thought was that we could just swap that around, and use the training image for the style image whereas the content image now stays fixed.

Naively, that should work. What I’m wondering is whether we would need to drastically change the architecture of the network we’re training, and whether we should use a different set of training images: The standard image dataset used in training the fast neural style transformer is the COCO set. Maybe a large dataset of potential styles would make more sense? Does such a dataset exist?

Has anyone else played with an idea like this? Other things I’m missing?