Hello!

I did an implementation of the style transfer paper a few weeks ago: https://github.com/slavivanov/Style-Tranfer

It might be interesting for some, because I relied more on tensorflow, and didn’t use LBFGS to optimize but Adam and (manually applied) SGD.

Let me know what you think and if you have any questions.

Very nice. A suggestion - that first example with the smoke texture might look better if you used an approach that maintained the color of the content. There’s a link to the paper here on the forum.

I’d be fascinated to see a speed comparison of Adam vs BFGS - that would make for a really interesting blog post, I think…

@jeremy, I did that comparison between different optimization methods for Style transfer!

You can take a look here: https://medium.com/slavv/picking-an-optimizer-for-style-transfer-86e7b8cba84b#.d6vgtyjie

Please let me know what you think.

@slavivanov wow, you are an absolutely exceptional technical writer. Really nice job. And very surprising results - I think worth digging into some more, because this is totally the opposite of what traditional thinking says (i.e. for deterministic functions, line search / hessian approaches should destroy SGD approaches!)

So my question is: can you get Adam to perform better? It’s already learning quickly, but it’s not learning well enough. I think you might be able to do better with SGD by using some learning rate strategies. Here’s a paper that I’ve never seen discussed, but is IMO really interesting: https://arxiv.org/abs/1506.01186 . Want to try using some learning rate annealing and even cyclical learning rates and see if you can get an SGD approach to beat BFGS? I’d suggest making it a new post, not just editing the one you have, since I think there’s going to be loads of new material - and just introducing the world to cyclical learning rates would be a great step…

I tweeted a link to this post and it’s gotten a lot of likes already!

(I also submitted it to reddit ML - hopefully some folks will add their own comments…)

This an awesome read. Thanks!

Nice. Thanks for the lucid write up of your experiments

Thank you for the kind words and the encouragement @jeremy, @brendan, @Surya501. And thanks for the tweet!

My intuition as to why the optimizers behave like this is still weak, although reading seems to help. I did append the post with 3 more experiments with lower learning rates for GD, RMSprop and Adadelta. Also tried out Adam with lower learning rate which beat the larger learning rate Adam given enough time.

Learning rate cycling and annealing sounds like a great follow up. Any other papers that you might recommend for it?

@jeremy Could you clarify something for me, related to the reddit discussion about stochastic methods.

As I understand it SGD operates on a part of the training data, or “online” on a single example. Thus the “stochastic”-ness.

Here we have a combination image, which is a parameter to be changed to minimize the loss function. Is this stochastic or not? Or does the “stochastic” property come from the iterative approach to the optimization?

If so, can this problem be approached with non-stochastic Gradient Descent? Is this feasible?

Thank you!

When each minibatch is providing a different set of target data (or a varying loss function) to the optimizer, it’s stochastic. In this case, the optimizer is getting the same target each time, so it’s not stochastic.

BFGS is non-stochastic gradient descent. There’s generally no point using a line search approach for SGD, since you’re wasting a lot of time optimizing a direction that’s only approximate. But for this problem it makes sense to do the line search, since it’s not stochastic.

Does that make more sense? Sorry if I’m not explaining this clearly…

I think I understand. So here I’m applying non-stochastic gradient descent methods. From your comments in reddit, it seems I could have done it in a stochastic way. How would I approach that?

No I’ve still not explained properly… Sorry!

So the approaches called “stochastic” gradient descent are approaches that are designed to work well in a stochastic setting (i.e. minibatch / online). They’re not actually stochastic themselves - what it really means is that they don’t do the 2 expensive things:

- Calculate or approximate the hessian
- Do a line search

…since there’s no point doing something so expensive when you only have a guess as to the correct direction anyway.

OTOH, when you have a deterministic problem (like the original artistic style paper), it’s worth doing these 2 things, and in generally you’d expect such an approach to easily beat any approach that doesn’t do these things.

What your results showed, curiously, is that Adam, which isn’t designed for this kind of problem, was faster to learn than BFGS, which is designed for this kind of problem. So what you really want to be able to show next is either:

- …that actually it was just because you didn’t tune the BFGS params well, and actually it does learn faster than Adam when tuned properly (which would be a somewhat dull result, but nonetheless instructive re how to tune BFGS properly), or
- …that Adam really is faster to learn that BFGS even for (at least some) deterministic problems, which should make us rethink deterministic optimization approaches in general.

In the latter case, can you use cyclical LR or annealing to have Adam beat BFGS in later epochs too? And if you can’t, it seems the best approach would be to do early iterations with Adam, and later with BFGS - which would be a nice thing to try out!..

So there’s a lot of interesting directions to take these initial experiments, IMO

Got it! Thank you for the detailed writeup.

It’s always nice to have someone explain me the experiments I’ve done

I do plan to expand this and test whether Adam with proper LR annealing/cycling can outperform L-BFGS. Your directions are most helpful!

I look forward to it - perhaps when you do, you can add some more background along the same lines as my last post; I’m sure others would find the context useful too…