Skip to content

Chapter 9: Autograd and Differentiation

“If tensors are the muscles, autograd is the nervous system.”


9.1 What Is Autograd?

Autograd is PyTorch’s automatic differentiation engine.
It builds a computation graph behind the scenes as you operate on tensors with requires_grad=True. When you call .backward(), it traces back through that graph to compute gradients.

This is what powers training in PyTorch — from basic logistic regression to massive transformers.


9.2 Enabling Gradient Tracking

To start tracking gradients:

x = torch.tensor([2.0, 3.0], requires_grad=True)

Now, any operation on x will be recorded:

y = x ** 2 + 3
z = y.sum()
z.backward()
print(x.grad)  # Output: tensor([4., 6.])
Here, ∂z/∂x = 2x → [2×2, 2×3] = [4., 6.]

9.3 Calling .backward()

Once you have a scalar result (like loss), call:

loss = model(input).sum()
loss.backward()
PyTorch will:

  • Walk backward through the computation graph

  • Calculate gradients for every tensor with requires_grad=True

  • Store gradients in the .grad attribute

9.4 Stopping Gradient Tracking

When you want to freeze parts of the model (e.g., during evaluation or feature extraction), use:

Method 1: with torch.no_grad()

with torch.no_grad():
    y = model(x)
Method 2: detach()
x = torch.tensor([1.0], requires_grad=True)
y = x * 2
z = y.detach()  # z does not track gradients

9.5 Checking the Computation Graph

You can inspect how PyTorch built the graph:

x = torch.tensor([2.0], requires_grad=True)
y = x * x
print(y.grad_fn)  # Output: <MulBackward0>

Every operation creates a Function object like AddBackward0, MulBackward0, etc. This is how PyTorch knows how to differentiate each step.

9.6 torch.autograd.Function: Custom Gradients

If you want to write your own forward/backward logic (like building a custom layer or operator):

class Square(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input ** 2

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        return grad_output * 2 * input

Use it like:

x = torch.tensor([3.0], requires_grad=True)
y = Square.apply(x)
y.backward()
print(x.grad)  # Should be 6.0

📌 Advanced, but useful for low-level ops and optimization research.

9.7 Common Mistakes

Mistake Fix
Calling .backward() on non-scalar Pass a gradient argument: z.backward(torch.ones_like(z))
Using .data instead of .detach() Use .detach().data is risky
In-place ops corrupting graph Avoid x += ... with autograd-tied tensors
Forgetting .zero_() on .grad Always zero gradients before .backward()
optimizer.zero_grad()  # Or model.zero_grad()
loss.backward()
optimizer.step()

9.8 Gradient Accumulation

By default, gradients accumulate:

x = torch.tensor([2.0], requires_grad=True)
y = x * 2
y.backward()
y.backward()
print(x.grad)  # Will be 4 + 4 = 8
Use x.grad.zero_() or optimizer.zero_grad() to prevent this.


9.9 Summary

  • requires_grad=True enables gradient tracking for a tensor.

  • .backward() triggers backpropagation from a scalar output.

  • Gradients are stored in .grad.

  • Use no_grad() or detach() to stop tracking.

  • Autograd builds a dynamic graph as you run — no static declarations.