Still the content2style ratio is super important.
Even in the regular style transfer (not fast), the weight you give to one loss vs the other has a huge impact. At least in my experiments.
Yes, getting this number out of the sweet spot completely destroys the result. Itās not a smooth factor, and thatās what I donāt like about it. If you are having good results and increase this number just a bit for āa little bit more styleā it can completely go all in on style and destroy everything
Totally. But I donāt know how to fix that.
On top of it, it is not even something you can really mathematically optimize for (probably yes, I have to think about it).
It is very related to personal taste around the end result (up to a point of course! a colored mess does not please anybody)
Hmā¦ Thinking about that, is it possible to have bounds on the content loss?
Can we somehow say āI donāt accept the content loss to be greater than Xā. (How to find X? Maybe empirically?)
We can also some kind āelasticā aspect to the content loss. Meaning that the further we deviated the more weight we give to it.
We can maybe first train the network only with content loss and then incrementally add style lossā¦
Good questions. It could be an idea.
All in all the system looked pretty unstable to me. So maybe applying guardrails during training to push things back on track could be reasonable.
Another interesting proof of this inherent instability of the model is the fact that I have never trained it for even one full epoch (on full COCO). I actually trained it on a super small subset, e.g. For 12 mins more or less. Both losses plateaued super quick and training for more would destroy the content 2 style balance. Quite insane.
I am planning on working on deep painterly harmonization soon https://sgugger.github.io/deep-painterly-harmonization.html#deep-painterly-harmonization so I will get back to this one!
Ow yes, that was what got me started!! I do want to work on that as well!!
So after experiment a lot Iāve made some discoveries.
The single most important factor was batch norm in vgg, using batch norm simply destroyed the results. I donāt understand why yet, but maybe batch norm obfuscates the importance of each feature?
Transformer net still the best (I do have to do more experiments with resnets though), but it seems that resnets tend to be more repetitive.
Adam optimizer with fit_one_cycle
works better than Ranger (which is confusing?)
I followed the tips from here and used shallower vgg layers, it helps creating bigger features.
I still have to experiment with combining multiple layers, itās very tricky thoughā¦ While original paper users relu3
for content loss pytorch examples uses relu2
Iāll post the code here tomorrow after I clean it. Maybe Iāll also write a blog post about it to describe this in more detail, is that a good idea?
Meanwhile, hereās a spoiler of the new implementation:
Congrats man! Your new results look really good! The batch norm thing is a great finding and you should totally write a blog post about all of this.
I will try to apply your findings on my code base and see if I get anything similar!
Adam optimizer with
fit_one_cycle
works better than Ranger (which is confusing?)
In the Mish thread it is asserted
Iām pretty much of the opinion now after a lot of testing that OneCycle destroys most of the momentum that smarter optimizers and Mish build up. I continue to see much better results avoiding it, unless itās with Vanilla adam and Relu. Then it seems to work well.
Iāve noticed the same thing. I stick with Adam/AdamW with 1cyxle.
I tried training with RAdam once more. I had to use smaller learning rates compared to one_cycle, but the results were actually a little betterā¦
Clearly I still have a lot of room to experiment here
Hey @lgvaz, out of curiosity, how big is your training set?
21837 images as path.ls()
returns in your nb?
Correct. Thatās only a sample of COCO.
Iām currently looking for a dataset with higher resolution images (like 1024x1024), any advises?
Thanks!
No idea for the high resolution dataset, unfortunately. Sorry.
Currently testing it out now, I have a bs of 1 and an image size of 1024 which uses ~9.28gb on colab, which means we could probably scale it up higher from there. Epoch time isnāt very different either (~9 minutes per epoch). Also note though we only have 1,000 images, and I just did the training so 800!
This is awesome stuff!!
Do you guys have any idea how the Copista app may be handling HD images:
The app shows progress bar while working, so it might have split the TensorFlow model into parts (there is an earlier post by the app author which says it uses TensorFlow Mobile).