You can pass any symbolic tensors to mse - it's what keras uses behind the scenes any time you use 'mse' as your loss function. Like nearly all keras functions, it works on symbolic tensors, so doesn't compute anything right away. (
predict is an example of an exception to this rule).
Why would you expect that? The input to VGG should have a mean of zero, so can't be in that range.
Not necessary, but it's nice to have the content loss and style loss to be the same order of magnitude, so we take the mean. In the paper they divide by 4*n^2*m^2 later, which has the same effect.
See the docs for that function - if you don't pass the gradient function, it'll need to calculate it with finite differencing, which is terribly slow.
Since this is a generic optimizer, it doesn't know how to deal with anything other than vector inputs, so we have to flatten what we provide the function.
The function expects float64 arrays to be passed to it.
Try playing around with standard scipy optimizers to get a feel for how they work, on non-deep learning problems. E.g. http://www.scipy-lectures.org/advanced/mathematical_optimization/