That’s the magical incantation we use to tell PyTorch that we want to calculate gradients with respect to that variable at that value. It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other, direct calculations on it that you will ask for.

However, I am unclear what the function does exactly. For instance, why do we need to keep track of the gradients to calculate before calculating (or, put another way, why do we need to call this function before we call backward()?).

My interpretation from further research is that we want to calculate the gradients at every weight in the network. I understand this is needed for the backpropagation algorithm to calculate gradients automatically and is used by a computational graph. However, it’s unclear to me how exactly this works and what the computational graph is.

I’m not sure I can do a better job than the book at explaining this but I’ll give it a shot. A model is made up of tensors. Inputs are passed through the model to generate an output. When I say passed through that means that mathematical operations are applied to the input and model weights to produce an output. During training the output of the model along with the target (or answer) is passed to a loss function to calculate how close the model output was to the target. Then the model calculates the gradients of the model weights during the backward pass. The gradients indicate if the model weights should be increased or decreased and roughly by how much to get the model to output to be closer to the target. requires_grad_ tells pytorch whether or not these particular tensors should be included in the back prop calculation. Generally all of the model weights would have requires grad set to true, but if this were an image model for example, the input tensor representing the raw image pixel values would have requires grad set to false as it doesn’t make sense that you would change your input in this case.

If you really want to dive into this from the bottom up rather than top down approach then you might want to check out Andrej Karpathy’s deep dive video on this topic where he creates his own autograd engine from scratch for educational purposes.

Change if autograd should record operations on this tensor: sets this tensor’s requires_grad attribute in-place. Returns this tensor.

Autograd does finite differencing. Delta y/ Delta w. How much does the output change when a particular weight is changed by a small amount. This can be done in a naive manner by changing a weight by a small amount and then running the full forward pass on the network to see how much the output has changed. Doing this over the thousands or millions or billions of weights is very inefficient. Instead, it is possible to avoid a lot of duplication if we remember the gradients at each layer and then use the chain rule to calculate the gradient in the previous layer and so forth. To do this, Pytorch has a mechanism (some kind of computation graph) which tracks what operations where done at each step on the tensors. This is optional, since this has some overhead. Requires_grad tells Pytorch that ‘yes, I want to track the gradients on this tensor’.

import torch
class ChainRuleDifferentiator:
def __init__(self, *funcs):
"""
Initialize the ChainRuleDifferentiator class with a sequence of functions.
:param funcs: A sequence of functions representing the chain.
"""
self.funcs = funcs
def derivative(self, x, order=1):
"""
Compute the derivative of the chain of functions at a given point.
:param x: The point at which the derivative is calculated.
:param order: The order of the derivative (1 for first derivative, 2 for second, etc.).
:return: The derivative value at the given point.
"""
x = torch.tensor([x], requires_grad=True)
result = x
for func in self.funcs:
result = func(result)
# Compute derivatives
for _ in range(order):
result = torch.autograd.grad(result, x, create_graph=True)[0]
return result.item()
# Define the functions for the chain
# try with simpler function, because outputs are really high
f1 = lambda x: 24 * x ** 2 # y = 24*x**2
f2 = lambda x: 8 * x ** 3 # z = 8*y**3
f3 = lambda x: 2 * x ** 4 # func = 2*z**4
# Create an instance of the ChainRuleDifferentiator with the functions
differentiator = ChainRuleDifferentiator(f1, f2, f3)
# Calculate the first and second derivatives at x = 2
x_value = 2.0 #grad requires float values so '2.0'
first_derivative = differentiator.derivative(x_value, order=1)
second_derivative = differentiator.derivative(x_value, order=2)
print("First Derivative at x =", x_value, ":", first_derivative)
print("Second Derivative at x =", x_value, ":", second_derivative)

As you’ve already mentioned, ‘requires_grad_(bool)’ is an in-place operation. It’s convenient when you already have a tensor created, but realize during programming that this tensor needs to be gradable. In other words, it’s about setting ‘requires_grad’ to ‘true’ for an existing tensor.

Karpathy is awesome. Chunk, chunk and make more and so on …