Differentiable loss function

Ekami · September 6, 2017, 3:32pm

In Part 2 - lesson 9 Jeremy mention:

We can optimize a loss function if we know that this loss function is differentiable

Here I ran into this intuitive image:

But I still don’t understand how can I intuitively represent myself in my head the shape of losses such as MSE, BCE, DICE… So I was wondering: How can we know (at an intuitive level) if the loss I’m trying to optimize would be, in fact, differentiable?
Thanks

vinvinvin · September 6, 2017, 6:12pm

At the risk of over-simplifying, maybe think of the word ‘differentiable’ as ‘smooth’.

MSE would be ‘smooth’ because it something squared or a quadratic.

Losses with ABS, MAX, CLIP, etc. would not be ‘smooth’.

Ekami · September 7, 2017, 12:10pm

Thanks a lot for your answer! I found a golden piece of information about the “smoothness” of a function here.

But another thing which come to my mind is: Is there any rules on which we can rely to make sure our function is smooth? For example for MSE you said:

MSE would be ‘smooth’ because it something squared or a quadratic.

Which I assume means that a quadratic/squared loss function is by essence smooth?
Is there any more rules in that direction?
For example, taking BCE or DICE loss function, how can I make sure they are “smooth” functions?
Thanks

marcemile · September 7, 2017, 5:13pm

I’m a bit confused about the idea of smoothness. ABS is not smooth but it is a valid loss function (not being differentiable at 0 is not a problem).
https://www.tensorflow.org/api_docs/python/tf/losses/absolute_difference

machinethink · September 7, 2017, 5:15pm

During training, the loss function compares the output of the neural network (the prediction) with the target label (the ground truth). The output of the neural network consists of all the calculations done by each of the neural network’s layers. So if each of these calculations is differentiable, then the loss function will also be differentiable.

Technically speaking, if your neural network uses a ReLU activation, it contains calculations that are not differentiable. However, for something like a ReLU it’s possible to use an approximate derivative.

Basically, if you just use the activation functions and loss functions that come with Keras, TensorFlow, etc, you’re good. But if you make your own loss function or activation function (or a custom layer type), then you’ll have to make sure that each of the parts of these calculations is differentiable.

Ekami · September 7, 2017, 5:30pm

I’m even more confused now lol. On one hand @marcemile shows a loss function which is not smooth (but which is a valid loss function from tensorflow) and on the other hand we can, in fact use functions such as ReLU which are not differentiable.
So 2 questions:

Can we mix up “smooth” and “differentiable”, does it mean the same thing/target the same kind of function?
Given non-differentiable “calculations” can have approximate derivative (and thus be used as loss functions) such as ReLU what other metric should we rely on to be 100% sure our custom function can be used as a loss function?
Thanks.

vinvinvin · September 7, 2017, 10:39pm

I think you’re getting caught up in nuance.

If you have both:

a function that takes some input and produces some result
a loss measure which evaluates that result in the context of the input

then you should be alright.

A function like softmax of (a,b,c) which is ‘smooth’ would be preferred over a function like max(a,b,c). This is because the partial derivative of softmax(a,b,c) will take a value with respect to all inputs while for max(a,b,c) the derivative will take value for only a,b, or c and be zero otherwise. When the derivative is zero, back-propagation can’t occur.

What is the specific loss function you are considering?

Ekami · September 9, 2017, 10:33am

Ok I see, yeah I probably mixed up few things in my head. It’s clearer now thanks.
I don’t have any particular loss function in mind, I just wanted to know how to make sure your loss function will “work” if you were to write a custom one.

gnak · September 10, 2017, 1:38pm

To show that a function is differentiable on an interval, you need to show that the limit definition of a derivative of all the elements in the interval holds true.

A function, is often a combination of other functions. EX: L(x) = f(x) + g(x)
where f(x) = (x-1)^2 and g(x) = x/2, so L(x) = (x-1)**2 + x/2. There are theorems in mathematics that state that iff f(x) and g(x) are differentiable, then so is L(x).

So one can analytically show that a given function is differentiable by using above facts. This is a common task in a calculus / analysis class.

To get an intuition / good sense on if a function is differentiable, I think it is a good idea to plot the function and study the shape of the plot. If the function is a multivariable function that maps to the real line, try finding its single variable version (or just two variables) and plot that instead.

Please note that differentiability is not sufficient for a global min/max. Properties like convexity/concavity matters. You can get a good idea of these properties as well by plots.

Plotting functions without a computer is also a skill; consult a calculus text book if you want to plot without a computer.