Gradients for softmax are tiny [Solved]

I don’t know what i was thinking.

I stared at it long enough, and i see :

\left[\begin{matrix}\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{0}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\end{matrix}\right]

All the terms except first one can be summed up to

\sum_{i}{} softmax_{x_0} * softmax_{x_i}

and the first term is simply

softmax_{x_i}

so the final derivative (after summing up partials) is :

softmax - softmax^2

This solution is numerically stable and gives gradient values that seem correct

thanks for your guide. The video and your implementation helped me grasp the things I needed

1 Like