I know e has some special mathematical properties, but I’m curious if those relate here, or was it just chosen because it tends to work well in practice?

I rewrote the function using an arbitrary number (5) instead of e and it still normalizes the activations so they sum up to 1:

exp(x) will always output a positive number that is greater than 0. That’s a useful math trick you see used all over the place.

(Softmax wouldn’t work if some of the numbers were negative, because of the sum in the denominator.)

You can use an exponential with a different base (as long as it’s greater than 0). No reason why that wouldn’t work. This does change the predicted probabilities a little, though.

Additionally, to train your model, you will be calculating the gradient (or derivative), where exp(x) has the handy property that its derivative is exp(x), which makes the calculations somewhat easier.

I had a feeling that it might have something to do with calculus. I’ll check out the 3blue1brown video on e to try and get a better understanding of this magical number (I’ve watched many of Grant’s videos and I agree, they’re amazing!)