I don’t know what i was thinking.
I stared at it long enough, and i see :
\left[\begin{matrix}\frac{e^{x_{0}}}{e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}} - \frac{e^{2 x_{0}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{1}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{2}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}} & - \frac{e^{x_{0}} e^{x_{3}}}{\left(e^{x_{0}} + e^{x_{1}} + e^{x_{2}} + e^{x_{3}}\right)^{2}}\end{matrix}\right]
All the terms except first one can be summed up to
\sum_{i}{} softmax_{x_0} * softmax_{x_i}
and the first term is simply
softmax_{x_i}
so the final derivative (after summing up partials) is :
softmax - softmax^2
This solution is numerically stable and gives gradient values that seem correct
thanks for your guide. The video and your implementation helped me grasp the things I needed