Why does softmax use e?

I’m going through fastbook chapter 5 (pet breeds). I understand the purpose of the softmax function, but I don’t understand why the number e is used:

def softmax(x): 
   return exp(x) / exp(x).sum(dim=1, keepdim=True)

I know e has some special mathematical properties, but I’m curious if those relate here, or was it just chosen because it tends to work well in practice?

I rewrote the function using an arbitrary number (5) instead of e and it still normalizes the activations so they sum up to 1:

def softmax(x): 
   return torch.pow(5, x) / torch.pow(5, x).sum(dim=1, keepdim=True)

exp(x) will always output a positive number that is greater than 0. That’s a useful math trick you see used all over the place.

(Softmax wouldn’t work if some of the numbers were negative, because of the sum in the denominator.)

You can use an exponential with a different base (as long as it’s greater than 0). No reason why that wouldn’t work. This does change the predicted probabilities a little, though.

Grant Sanderson has an awesome explanation to make sense of why it’s so revered: https://youtu.be/m2MIpDrF7Es

His whole series on Calculus is worth checking out – along with basically everything he makes :slight_smile:

1 Like

Additionally, to train your model, you will be calculating the gradient (or derivative), where exp(x) has the handy property that its derivative is exp(x), which makes the calculations somewhat easier.

∂ exp(x) / ∂x = exp(x)

Thanks for the replies!

I had a feeling that it might have something to do with calculus. I’ll check out the 3blue1brown video on e to try and get a better understanding of this magical number (I’ve watched many of Grant’s videos and I agree, they’re amazing!)

1 Like