Fast Style Transfer in fastai v1

Hello fastai enthusiasts!
I would like to implement Fast Neural Style Transfer (FNST) using fastai v1.
I already did it in plain PyTorch, inspired, for the data part, by the DataBlock API as explained by @jeremy in v3 part 2 (here and here for my previous work).

Given this notebook covered in the awesome Lesson 7 v3 part 1, I thought it would be a good idea to build on it and give FNST a new try.

In theory, to make this happen, I just have to customize which data gets fed into the learner and the loss function. The unet_learner will take care of training without further edits.

To do that, I have 2 choices:

  1. Edit src.label_from_func(lambda x: path_hr/ in order to label each input image with a content and a style image. So something like: src.label_from_func(lambda x: [path_to_content, path_to_style]), e.g. label every image with a list of 2 images. I cannot get this to work.
  2. Technically option 1 is redundant as I could use my input image as content image, so nothing needs to be changed in the labeling logic (this would work src.label_from_func(lambda x: path_to_style)). In this case, of course, I would have to change the forward pass of FeatureLoss to accept 3 arguments: input to the unet (e.g. content image), target (e.g. style image) and pred (e.g. output of the unet). I cannot find online any documentation on how to hack a PyTorch loss function to do that. For obvious reasons, it normally accepts preds and targets only.

Would anyone be so kind to put me on the right track here, please?

Thanks so much!


Hi Francesco,

I’m trying to learn from your style transfer notebooks, I really appreciate your work!
Have you tried to subclass the SegmentationLabelList class and make a new open() function that returns a list of two images?
Or if you solved your problem already, can I see how ? ;-D

Hi Chris,
no, I haven’t tried that. What I implemented is plain PyTorch code and, not being very familiar with the DataBlock API, I don’t really know where to start to get a custom solution up and running.
I will try what you suggest!
It seems like a good starting point!
Thanks for the suggestion

Hello @FraPochetti, I hope you still interested in this

I’m also trying to implement this in v1 (and v2 as well). To solve this problem I went into a orthogonal direction from what you proposed.

Since we only train the network on one specif style we only need to calculate the style features once before we started training. And then we can use that pre-calculated value on our loss function, this way we don’t have to modify the DataBlock API. IN this case both x and y are going to be the same image!

I don’t think I was able to properly explain what I did, please take a look at the code here for more details.

Style transfer is a very new field for me, so I still didn’t managed to get good results, at this point I’m basically trying to debug my code by following your blog post :grin:

If you’re interested we can even collaborate in this project :partying_face:

@lgvaz this is a great approach! Thanks for sharing! Admittedly, I had not thought about it :wink:.
I quickly looked at your code and it makes complete sense.
As for now, I am taken into another side project, so I think I will be able to look into it again in the second half of December. Unless you figure it out first, of course!
In any case, I will keep you posted.

1 Like

It is pretty weird that your approach is not working as everything looks good.
It seems there is not enough style.
Have you tried increasing stl_loss_mult to something like 1e10 and using cnt_lsw=[1,1,1] (just to check the impact of an unweighted content loss)?
Also, FYI, in my code I incorporate the total variation loss into the mix, but it is almost useless.

1 Like

First of all thank you for the help =)

I tried increase stl_loss_mult but then I ended up having a big orange mess, I got the impression that this term is very sensible, either I get the exact input image or a big orange mess.

I also saw the tv loss in your blog post, but since you said it was not making any significant difference I decided to not implement it. :laughing:

Another difference is that I’m using resnet as a backbone instead of the transformer net, in your blog post you said you’re not getting good results with resnet18, but how bad was it?

I’m also not using InstanceNorm but instead SpectralNorm, this is for the sole reason that InstanceNorm is not built in fastai, but I do expect results to get better with InstanceNorm.

So just to key out the differences (that I can spot at least):

  • Resnet backbone instead of transformer net
  • SpectralNorm instead of InstanceNorm
  • In the content loss I’m not only using relu3_3 but instead a weighted combination of relu2_2, relu3_3 and relu4_3

Today I was thinking about switching resnet to transformer net and see the results. The problem is that currently I don’t know if this bad result is caused by this described differences or by a bug.

I agree with your analysis.
Resnet was worse in the sense that even tuning the content and style weights I either ended up with all style or with all content. It was quite hard to find the sweet spot. Very surprising.
So indeed the architecture could be the real deal breaker for you too.
Overall what I noticed is that style transfer is massively affected by the content and style weights in the loss function, as quite easily one overtakes the other. Really sensitive stuff.
I also agree with you that it is hard to know if your issue is driven by a bug rather than something else.
Probably the easiest is to switch to a different arch and see how it goes.

1 Like

I will for sure get Bach to this problem myself by the end of the year, but please do not hesitate to keep me posted on your experiments. I find this super interesting!

1 Like

Alright! So I’m going to start by trying TN instead of ResNet, I’m for sure going to keep you updated.

Thank you again for the help Pochetti =)

Hey @FraPochetti, so I got this to work pretty well (I think) and it only takes about an hour to train (one epoch on COCO 256x256). These are the results I’m getting:

There are some “bugs” happening in some parts of the image (take a look at the legs), any idea on what that might be?


Really cool man!
Which style is that?
How did you fix it?
As for the bugs, I had noticed myself they sometimes occurred also for me. No clue where they might come from. I had tried clipping the pixel values to check if that would remove the artifacts, with no luck.

1 Like

You can find the style and the code in this notebook.

The code is a complete mess, I had to change some fastai building blocks to incorporate InstanceNorm (hopefully I can do a PR with these changes).

Two things helped fixing the issue:

I changed the architecture to one that more closely resembles Transformer Net. I still wanted to use DynamicUnet from fastai, because that have PixelShuffle, blur, self-attention, and all that fun stuff, so I built the encoder part that resembles TN and let fastai figures out the decoder.

Before I was using a combination of layers in the content loss, now I’m using the single layer described on the paper. I still think using multiple layers should give a better result somehow… Still have to figure that out (imagine using different styles on different layers :scream: :scream: at the same time)


Another fun fact. When training a new style we don’t have to start from scratch. Starting from the weights of a previous style can significantly speed up learning


Cool stuff man. Great work!

1 Like

Hey @lgvaz you are using fastaiV2, the one currently under development, right?

Correct sir!

1 Like

Your results are quite neat indeed.
I like them a lot better than what I got in my experiments (here for Picasso).
Kandinsky looked nicer though :slight_smile: .
And it was super hard to get there!

This is why I tried to turn towards after a while, as I was not doing anything special on the training side (fit_one_cycle etc) and I was missing all the goodies (the ones you mentioned above already from DynamicUnet).

Overall, it always blows my mind away, how hard it is to get stuff right.
I mean, you actually MUST use the transformer net architecture to get something decent.
You get a detail wrong, everything is completely screwed up :smiley:


What blows my mind is the potential of this thing. The transformer net is a very simple arch that doesn’t get even close to U-net + resnet backbone.

I still want to make this work with a big resnet and I think some small tweaks might be what’s left.

A very good question is: “Can results be even better than what we’re seeing right now?”

I think the answer is yes, if you look closely the results tend to be very repetitive at some points, specially at the background. And there’re still those “break points” in the image, I don’t have any idea why that happens, but it’s related to the U-net optimizations described above.

And there is the possibility of merging multiple styles as well!!! That would be bomb!

Another thing I would like to do is to make the hyper-parameters less sensible, I saw in your implementation that you implemented auto scaling in the losses, how much that improved your experiments?


Yeah, I agree results can and should be better.
I also think, as you already noticed, that hyper-parameters were super sensitive.
In my case, I did not really scale content and style losses in a truly automated fashion.
I just run a couple of random batches through the unscaled model and calculate the content2style ratio.
It helped in the sense that, after multiplying by this ratio, I had content and style on the same scale and so, if I wanted to give more weight to one versus the other, I had to multiply by human-sized numbers. Like 1.5, 2, or something like that. Not 1e10. I was more in control.

From a practical standpoint, I am not sure how much of a real difference that made on the end result.
At the end of the day I was still pre-multiplying by the huge (or tiny) content2style ratio.

1 Like