Fast Style Transfer in fastai v1

Still the content2style ratio is super important.
Even in the regular style transfer (not fast), the weight you give to one loss vs the other has a huge impact. At least in my experiments.

1 Like

Yes, getting this number out of the sweet spot completely destroys the result. Itā€™s not a smooth factor, and thatā€™s what I donā€™t like about it. If you are having good results and increase this number just a bit for ā€œa little bit more styleā€ it can completely go all in on style and destroy everything :sweat_smile:

1 Like

Totally. But I donā€™t know how to fix that.

On top of it, it is not even something you can really mathematically optimize for (probably yes, I have to think about it).
It is very related to personal taste around the end result (up to a point of course! a colored mess does not please anybody)

1 Like

Hmā€¦ Thinking about that, is it possible to have bounds on the content loss?

Can we somehow say ā€œI donā€™t accept the content loss to be greater than Xā€. (How to find X? Maybe empirically?)

We can also some kind ā€œelasticā€ aspect to the content loss. Meaning that the further we deviated the more weight we give to it.

We can maybe first train the network only with content loss and then incrementally add style lossā€¦

Good questions. It could be an idea.
All in all the system looked pretty unstable to me. So maybe applying guardrails during training to push things back on track could be reasonable.
Another interesting proof of this inherent instability of the model is the fact that I have never trained it for even one full epoch (on full COCO). I actually trained it on a super small subset, e.g. For 12 mins more or less. Both losses plateaued super quick and training for more would destroy the content 2 style balance. Quite insane.

1 Like

I am planning on working on deep painterly harmonization soon https://sgugger.github.io/deep-painterly-harmonization.html#deep-painterly-harmonization so I will get back to this one!

1 Like

Ow yes, that was what got me started!! I do want to work on that as well!!

1 Like

So after experiment a lot Iā€™ve made some discoveries.

The single most important factor was batch norm in vgg, using batch norm simply destroyed the results. I donā€™t understand why yet, but maybe batch norm obfuscates the importance of each feature?

Transformer net still the best (I do have to do more experiments with resnets though), but it seems that resnets tend to be more repetitive.

Adam optimizer with fit_one_cycle works better than Ranger (which is confusing?)

I followed the tips from here and used shallower vgg layers, it helps creating bigger features.

I still have to experiment with combining multiple layers, itā€™s very tricky thoughā€¦ While original paper users relu3 for content loss pytorch examples uses relu2

Iā€™ll post the code here tomorrow after I clean it. Maybe Iā€™ll also write a blog post about it to describe this in more detail, is that a good idea?

Meanwhile, hereā€™s a spoiler of the new implementation:

5 Likes

Congrats man! Your new results look really good! The batch norm thing is a great finding and you should totally write a blog post about all of this.

1 Like

I will try to apply your findings on my code base and see if I get anything similar!

1 Like

Adam optimizer with fit_one_cycle works better than Ranger (which is confusing?)

In the Mish thread it is asserted

Iā€™m pretty much of the opinion now after a lot of testing that OneCycle destroys most of the momentum that smarter optimizers and Mish build up. I continue to see much better results avoiding it, unless itā€™s with Vanilla adam and Relu. Then it seems to work well.

Iā€™ve noticed the same thing. I stick with Adam/AdamW with 1cyxle.

4 Likes

Here is the link :smile:

Changes and PRs are much appreciated!

3 Likes

I tried training with RAdam once more. I had to use smaller learning rates compared to one_cycle, but the results were actually a little betterā€¦

Clearly I still have a lot of room to experiment here :sweat_smile:

1 Like

Hey @lgvaz, out of curiosity, how big is your training set?
21837 images as path.ls() returns in your nb?

Correct. Thatā€™s only a sample of COCO.

Iā€™m currently looking for a dataset with higher resolution images (like 1024x1024), any advises?

Thanks!
No idea for the high resolution dataset, unfortunately. Sorry.

@lgvaz hereā€™s one that are 2K https://data.vision.ee.ethz.ch/cvl/DIV2K/

3 Likes

Currently testing it out now, I have a bs of 1 and an image size of 1024 which uses ~9.28gb on colab, which means we could probably scale it up higher from there. Epoch time isnā€™t very different either (~9 minutes per epoch). Also note though we only have 1,000 images, and I just did the training so 800!

results

3 Likes

This is awesome stuff!!

Do you guys have any idea how the Copista app may be handling HD images:

The app shows progress bar while working, so it might have split the TensorFlow model into parts (there is an earlier post by the app author which says it uses TensorFlow Mobile).