Skip to content

Chapter 17: Using torch with CUDA

“If your tensors aren’t on the GPU, are they even lifting?”


17.1 What is CUDA?

CUDA stands for Compute Unified Device Architecture — NVIDIA’s parallel computing platform.

In PyTorch, it means:

  • Massive speedups via GPU acceleration
  • Easy-to-use APIs to move computation to CUDA
  • Seamless switching between CPU and GPU

PyTorch abstracts CUDA beautifully. If you can use .to('cuda'), you can GPU.


17.2 Check CUDA Availability

Before using CUDA, always check:

import torch
torch.cuda.is_available()  # Returns True if CUDA is ready
torch.cuda.device_count()  # Number of available GPUs

17.3 Setting Your Device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

x = torch.randn(5, 5).to(device)
model = model.to(device)

You can also specify GPU index: 'cuda:0', 'cuda:1', etc.


17.4 Moving Data to and from GPU

x = torch.tensor([1.0, 2.0])
x_cuda = x.to('cuda')

# Back to CPU
x_cpu = x_cuda.to('cpu')

⚠️ Tensors must be on the same device for math to work.
❌ CPU-GPU ops will crash with RuntimeError.


17.5 Multi-GPU Usage

➤ List all GPUs:

for i in range(torch.cuda.device_count()):
    print(torch.cuda.get_device_name(i))

➤ Move model to a specific GPU:

model = model.to('cuda:1')

➤ Use DataParallel (basic multi-GPU training):

from torch.nn import DataParallel

model = DataParallel(model)
model = model.to('cuda')

✅ Automatically splits input batches
For large-scale training, DistributedDataParallel is preferred (coming in advanced chapters)


17.6 Memory Management and Stats

➤ Track VRAM usage:

torch.cuda.memory_allocated()
torch.cuda.memory_reserved()

➤ Free unused memory:

torch.cuda.empty_cache()

This won’t free memory from PyTorch internally, but it makes it available to other applications.


17.7 Common CUDA Pitfalls

Pitfall Fix
Mixing CPU and GPU tensors .to(device) both inputs before operations
Forgetting .to(device) on model Model stays on CPU → loss never goes down
Out of memory (OOM) Reduce batch size or use with torch.no_grad()
CUDA slower than CPU (tiny model) CUDA overhead may outweigh benefits
GPU idle, CPU overloaded Use num_workers in DataLoader + pin_memory

17.8 AMP (Automatic Mixed Precision)

For faster training with less memory usage:

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for input, target in dataloader:
    optimizer.zero_grad()
    with autocast():
        output = model(input)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

AMP = ⚡ Speed + 💾 Efficiency without major code rewrites.


17.9 Benchmark Settings (CuDNN)

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
- Set benchmark=True to let PyTorch auto-optimize conv performance

  • Set deterministic=True if you need exact reproducibility

17.10 Summary

Action Code Example
Set device device = torch.device("cuda")
Move tensor/model .to(device)
Multi-GPU (basic) torch.nn.DataParallel(model)
Monitor memory usage memory_allocated(), empty_cache()
Mixed precision training torch.cuda.amp.autocast()
  • GPUs = speed — use them wisely

  • .to(device) is your best friend — for tensors, models, inputs, and labels

  • Track your memory, use AMP for large models, and never mix CPU + CUDA in a single operation