Style Transfer with Adam and SGD

@slavivanov wow, you are an absolutely exceptional technical writer. Really nice job. And very surprising results - I think worth digging into some more, because this is totally the opposite of what traditional thinking says (i.e. for deterministic functions, line search / hessian approaches should destroy SGD approaches!)

So my question is: can you get Adam to perform better? It’s already learning quickly, but it’s not learning well enough. I think you might be able to do better with SGD by using some learning rate strategies. Here’s a paper that I’ve never seen discussed, but is IMO really interesting: https://arxiv.org/abs/1506.01186 . Want to try using some learning rate annealing and even cyclical learning rates and see if you can get an SGD approach to beat BFGS? I’d suggest making it a new post, not just editing the one you have, since I think there’s going to be loads of new material - and just introducing the world to cyclical learning rates would be a great step…