Derivatives and custom loss functions in Keras

I remember in a few of the lessons @jeremy talked about how we don’t need to worry about the derivative side of our loss function because keras could automatically calculate it for us.

I just want to make sure my understanding there is correct, particularly in the case for custom loss functions. Can we combine arbitrary functions in our loss function and then expect keras to solve for the derivative in the underlying loss function space? Is keras doing so by solving for the local gradient?

I’m currently working on a style transfer problem with a loss function that combines a series of histograms as well as the sse of two images as the style component and I’m wondering if I need to do anything more than just rewrite the loss function.

The style transfer examples from lesson 8-10 make it seem very straightforward but I want to be sure that this concept generalizes and that I’m not making incorrect assumptions.

If you construct the loss using the built-in operations then it’s automatic. Otherwise, see:

Given a graph of ops, TensorFlow uses automatic differentiation (backpropagation) to add new ops representing gradients with respect to the existing ops (see Gradient Computation). To make automatic differentiation work for new ops, you must register a gradient function which computes gradients with respect to the ops’ inputs given gradients with respect to the ops’ outputs.


Thanks @kelvin, that helps.

To further clarify, if I am working in a space that I’ve transformed an image into, for example histograms, and the loss is calculated within that space using ops then I should be fine?

In other words I could use MSE or an ops based earthmover’s distance on the histograms and keras will automatically generate the gradient? Or does the transform from image->histogram also factor in?

In the initial style transfer example the loss function is MSE difference of neural net layers, which is a mapping into another space, but they’re also a part of the nn and the nn is doing that mapping.

If the image -> histogram mapping isn’t a part of the nn then can we still rely on the ops based approach to automatically compute the gradient? It makes sense to me that we can because the gradient that we’re solving for is the one within the histogram space, which is the space we want to move in, but I want to be sure that that mapping to histogram space happening outside of the nn isn’t violating some assumption.

Tensorflow is creating a graph of operations to backprop against. If one part of the graph is disconnected (image->histogram) it doesn’t know how to flow the gradients across it.

There is a histogram operation:

You might need to write a custom op for your use case though:

1 Like

Note that if you do want to wrap a tf function with a keras layer, you can use my ReflectionPadding example (which basically just wraps tf.pad) as an example to work from.

1 Like

Presumably if I could wrap it in a lambda function then it would be a part of the graph? (which should be doable for histograms, but maybe not for some of the other operations I want to do)

Does a lambda function then need to be made up entirely of ops? I didn’t realize that was a requirement.

It looks like I’m going to have to bone up on my calculus in order to figure out what the derivatives are for the loss function i’m working on.

Thanks for the detailed answers here, it’s really helped me understand loss functions as they relate to keras and tensorflow much better.

Thanks. That’s a helpful starting point.

I’m confused in how to get loss function work properly. For instance, if I have image segmentation task, and the loss is 1- dice coef. Then the I think I should have the custom function to be (1/2*(1-dice coef)^2. correct? I’m using keras.