Skip to content

Chapter 19: Debugging, Profiling, and Best Practices

“Where code either gets smarter… or gets you fired.”

Let’s wrap up Part IV with the good stuff: not the fancy models or sexy math, but the tools that make sure your code doesn’t silently ruin your entire experiment while you’re staring at a loss curve wondering what went wrong.

This chapter is your battle-tested field guide for debugging, profiling, and writing PyTorch that doesn’t betray you.


19.1 Debugging Tensor Values

First rule of PyTorch debugging: Check your tensors early and often.

print(tensor.shape)
print(torch.isnan(tensor).any())
print(torch.isinf(tensor).any())

➤ Check for exploding/vanishing gradients:

for name, param in model.named_parameters():
    if param.grad is not None:
        print(f"{name}: grad norm = {param.grad.norm()}")

19.2 Common Silent Killers

Bug Symptom Fix
Using .data Breaks autograd Use .detach()
Mixing CPU and CUDA tensors RuntimeError or silent slowdown Use .to(device) consistently
Forgetting model.train() Dropout/BatchNorm behaves incorrectly Always use .train() / .eval()
Wrong input shapes Model compiles but outputs garbage Print input/output shapes before layers
In-place ops Loss gets stuck / None gradients Avoid x += y; use x = x + y

19.3 Debug Mode with torch.autograd.set_detect_anomaly

Use this to catch:

  • In-place ops that break gradients

  • NaNs in backward pass

  • Invalid computation graph paths

with torch.autograd.set_detect_anomaly(True):
    loss.backward()

⚠️ Slightly slower — but worth it during debugging.


19.4 Profiler for Performance Tuning

Basic usage:

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True
) as prof:
    run_training_step()

print(prof.key_averages().table(sort_by="cuda_time_total"))

Shows CPU and GPU time per op — useful for finding bottlenecks.


19.5 Tracking GPU Memory

print(torch.cuda.memory_summary(device=None, abbreviated=False))
Common VRAM hogs:

  • Large models without checkpoint()

  • Storing intermediate results (forgetting to .detach())

  • Retaining computation graphs across batches


19.6 Tips for Clean, Modular Code

Use a consistent device management strategy:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
model = model.to(device)
Structure your project:

  • model.py — architectures

  • train.py — training loop

  • utils.py — reusable functions

  • config.py — hyperparameters

  • debug.py — sanity checkers, asserts


19.7 Sanity Check Checklist

✔ Do your model inputs/outputs have expected shapes?
✔ Are .requires_grad flags correctly set?
✔ Is your loss decreasing over time?
✔ Do .grad values explode or vanish?
✔ Did you call .train() and .eval() properly?
✔ Are you detaching everything you log or store?
✔ Are any tensors stuck on CPU while the model is on GPU?


19.8 Best Practices at a Glance

Practice Why it Matters
Always zero gradients Avoid accumulation across batches
Use .detach() for logging Avoid unwanted graph retention
Profile early Find slow layers before deployment
Use mixed precision Save memory, speed up training
Assert shapes regularly Prevent silent failures
Avoid silent overfitting Validate early, not just at the end

19.9 Summary

Tool / Tip Use Case
set_detect_anomaly(True) Catch bad gradients / in-place ops
torch.profiler Pinpoint slow layers
.grad.norm() monitoring Detect exploding/vanishing gradients
memory_summary() See where your VRAM is going
Code modularization Keeps training and model logic clean

PyTorch is flexible — but that flexibility means you have to be responsible for sanity.
Trust nothing. Print everything. Profile often.