Wow this is just terrific, once again! You have a real knack for technical writing. I noticed a couple of little issues FYI:
The second way to go about it, and, in fact, the easiest to implement, is to approximate the derivative with the following formula we know from calculus:
I don’t see a formula here.
The most fast method for calculating would be analytically find the derivative for each neural network architecture.
This is what backprop does, right? Otherwise - I’m not sure what distinction you’re making. Backprop simply calculates the derivative using the chain rule. I think the more important thing to mention here is that some libraries like Pytorch can calculate the analytical derivative for arbitrary Python code, so you don’t have to worry about doing it yourself.
Some of them are described in my other post ‘Improving the way we work with learning rate’.
You should link to your post here. (In general, I think your article to use a few more hyperlinks, BTW.)
Finally, I spotted at least one spelling mistake, so perhaps run a spell-checker over it?