Fast Style Transfer in fastai v1

FraPochetti · November 2, 2019, 4:59pm

Hello fastai enthusiasts!
I would like to implement Fast Neural Style Transfer (FNST) using fastai v1.
I already did it in plain PyTorch, inspired, for the data part, by the DataBlock API as explained by @jeremy in v3 part 2 (here and here for my previous work).

Given this notebook covered in the awesome Lesson 7 v3 part 1, I thought it would be a good idea to build on it and give FNST a new try.

In theory, to make this happen, I just have to customize which data gets fed into the learner and the loss function. The unet_learner will take care of training without further edits.

To do that, I have 2 choices:

Edit src.label_from_func(lambda x: path_hr/x.name) in order to label each input image with a content and a style image. So something like: src.label_from_func(lambda x: [path_to_content, path_to_style]), e.g. label every image with a list of 2 images. I cannot get this to work.
Technically option 1 is redundant as I could use my input image as content image, so nothing needs to be changed in the labeling logic (this would work src.label_from_func(lambda x: path_to_style)). In this case, of course, I would have to change the forward pass of FeatureLoss to accept 3 arguments: input to the unet (e.g. content image), target (e.g. style image) and pred (e.g. output of the unet). I cannot find online any documentation on how to hack a PyTorch loss function to do that. For obvious reasons, it normally accepts preds and targets only.

Would anyone be so kind to put me on the right track here, please?

Thanks so much!

chrisdinant · November 9, 2019, 3:52pm

Hi Francesco,

I’m trying to learn from your style transfer notebooks, I really appreciate your work!
Have you tried to subclass the SegmentationLabelList class and make a new open() function that returns a list of two images?
Or if you solved your problem already, can I see how ? ;-D

FraPochetti · November 10, 2019, 3:52pm

Hi Chris,
no, I haven’t tried that. What I implemented is plain PyTorch code and, not being very familiar with the fast.ai DataBlock API, I don’t really know where to start to get a custom solution up and running.
I will try what you suggest!
It seems like a good starting point!
Thanks for the suggestion

lgvaz · November 23, 2019, 12:23am

Hello @FraPochetti, I hope you still interested in this

I’m also trying to implement this in v1 (and v2 as well). To solve this problem I went into a orthogonal direction from what you proposed.

Since we only train the network on one specif style we only need to calculate the style features once before we started training. And then we can use that pre-calculated value on our loss function, this way we don’t have to modify the DataBlock API. IN this case both x and y are going to be the same image!

I don’t think I was able to properly explain what I did, please take a look at the code here for more details.

Style transfer is a very new field for me, so I still didn’t managed to get good results, at this point I’m basically trying to debug my code by following your blog post

If you’re interested we can even collaborate in this project

FraPochetti · November 23, 2019, 10:18am

@lgvaz this is a great approach! Thanks for sharing! Admittedly, I had not thought about it .
I quickly looked at your code and it makes complete sense.
As for now, I am taken into another side project, so I think I will be able to look into it again in the second half of December. Unless you figure it out first, of course!
In any case, I will keep you posted.

FraPochetti · November 23, 2019, 10:35am

It is pretty weird that your approach is not working as everything looks good.
It seems there is not enough style.
Have you tried increasing stl_loss_mult to something like 1e10 and using cnt_lsw=[1,1,1] (just to check the impact of an unweighted content loss)?
Also, FYI, in my code I incorporate the total variation loss into the mix, but it is almost useless.

lgvaz · November 23, 2019, 2:52pm

First of all thank you for the help =)

I tried increase stl_loss_mult but then I ended up having a big orange mess, I got the impression that this term is very sensible, either I get the exact input image or a big orange mess.

I also saw the tv loss in your blog post, but since you said it was not making any significant difference I decided to not implement it.

Another difference is that I’m using resnet as a backbone instead of the transformer net, in your blog post you said you’re not getting good results with resnet18, but how bad was it?

I’m also not using InstanceNorm but instead SpectralNorm, this is for the sole reason that InstanceNorm is not built in fastai, but I do expect results to get better with InstanceNorm.

So just to key out the differences (that I can spot at least):

Resnet backbone instead of transformer net
SpectralNorm instead of InstanceNorm
In the content loss I’m not only using relu3_3 but instead a weighted combination of relu2_2, relu3_3 and relu4_3

Today I was thinking about switching resnet to transformer net and see the results. The problem is that currently I don’t know if this bad result is caused by this described differences or by a bug.

FraPochetti · November 23, 2019, 5:14pm

I agree with your analysis.
Resnet was worse in the sense that even tuning the content and style weights I either ended up with all style or with all content. It was quite hard to find the sweet spot. Very surprising.
So indeed the architecture could be the real deal breaker for you too.
Overall what I noticed is that style transfer is massively affected by the content and style weights in the loss function, as quite easily one overtakes the other. Really sensitive stuff.
I also agree with you that it is hard to know if your issue is driven by a bug rather than something else.
Probably the easiest is to switch to a different arch and see how it goes.

FraPochetti · November 23, 2019, 5:16pm

I will for sure get Bach to this problem myself by the end of the year, but please do not hesitate to keep me posted on your experiments. I find this super interesting!

lgvaz · November 23, 2019, 5:18pm

Alright! So I’m going to start by trying TN instead of ResNet, I’m for sure going to keep you updated.

Thank you again for the help Pochetti =)

lgvaz · November 27, 2019, 6:00pm

Hey @FraPochetti, so I got this to work pretty well (I think) and it only takes about an hour to train (one epoch on COCO 256x256). These are the results I’m getting:

There are some “bugs” happening in some parts of the image (take a look at the legs), any idea on what that might be?

FraPochetti · November 27, 2019, 6:14pm

Really cool man!
Which style is that?
How did you fix it?
As for the bugs, I had noticed myself they sometimes occurred also for me. No clue where they might come from. I had tried clipping the pixel values to check if that would remove the artifacts, with no luck.

lgvaz · November 27, 2019, 6:21pm

You can find the style and the code in this notebook.

The code is a complete mess, I had to change some fastai building blocks to incorporate InstanceNorm (hopefully I can do a PR with these changes).

Two things helped fixing the issue:

I changed the architecture to one that more closely resembles Transformer Net. I still wanted to use DynamicUnet from fastai, because that have PixelShuffle, blur, self-attention, and all that fun stuff, so I built the encoder part that resembles TN and let fastai figures out the decoder.

Before I was using a combination of layers in the content loss, now I’m using the single layer described on the paper. I still think using multiple layers should give a better result somehow… Still have to figure that out (imagine using different styles on different layers at the same time)

lgvaz · November 27, 2019, 6:24pm

Another fun fact. When training a new style we don’t have to start from scratch. Starting from the weights of a previous style can significantly speed up learning

FraPochetti · November 27, 2019, 6:37pm

Cool stuff man. Great work!

FraPochetti · November 28, 2019, 7:06am

Hey @lgvaz you are using fastaiV2, the one currently under development, right?

lgvaz · November 28, 2019, 12:43pm

Correct sir!

FraPochetti · November 28, 2019, 3:37pm

Your results are quite neat indeed.
I like them a lot better than what I got in my experiments (here for Picasso).
Kandinsky looked nicer though .
And it was super hard to get there!

This is why I tried to turn towards fast.ai after a while, as I was not doing anything special on the training side (fit_one_cycle etc) and I was missing all the fast.ai goodies (the ones you mentioned above already from DynamicUnet).

Overall, it always blows my mind away, how hard it is to get stuff right.
I mean, you actually MUST use the transformer net architecture to get something decent.
You get a detail wrong, everything is completely screwed up

lgvaz · November 28, 2019, 4:09pm

What blows my mind is the potential of this thing. The transformer net is a very simple arch that doesn’t get even close to U-net + resnet backbone.

I still want to make this work with a big resnet and I think some small tweaks might be what’s left.

A very good question is: “Can results be even better than what we’re seeing right now?”

I think the answer is yes, if you look closely the results tend to be very repetitive at some points, specially at the background. And there’re still those “break points” in the image, I don’t have any idea why that happens, but it’s related to the U-net optimizations described above.

And there is the possibility of merging multiple styles as well!!! That would be bomb!

Another thing I would like to do is to make the hyper-parameters less sensible, I saw in your implementation that you implemented auto scaling in the losses, how much that improved your experiments?

FraPochetti · November 28, 2019, 4:34pm

Yeah, I agree results can and should be better.
I also think, as you already noticed, that hyper-parameters were super sensitive.
In my case, I did not really scale content and style losses in a truly automated fashion.
I just run a couple of random batches through the unscaled model and calculate the content2style ratio.
It helped in the sense that, after multiplying by this ratio, I had content and style on the same scale and so, if I wanted to give more weight to one versus the other, I had to multiply by human-sized numbers. Like 1.5, 2, or something like that. Not 1e10. I was more in control.

From a practical standpoint, I am not sure how much of a real difference that made on the end result.
At the end of the day I was still pre-multiplying by the huge (or tiny) content2style ratio.