Chapter 17: Using torch
with CUDA¶
“If your tensors aren’t on the GPU, are they even lifting?”
17.1 What is CUDA?¶
CUDA stands for Compute Unified Device Architecture — NVIDIA’s parallel computing platform.
In PyTorch, it means:
- Massive speedups via GPU acceleration
- Easy-to-use APIs to move computation to CUDA
- Seamless switching between CPU and GPU
PyTorch abstracts CUDA beautifully. If you can use
.to('cuda')
, you can GPU.
17.2 Check CUDA Availability¶
Before using CUDA, always check:
import torch
torch.cuda.is_available() # Returns True if CUDA is ready
torch.cuda.device_count() # Number of available GPUs
17.3 Setting Your Device¶
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = torch.randn(5, 5).to(device)
model = model.to(device)
You can also specify GPU index:
'cuda:0'
,'cuda:1'
, etc.
17.4 Moving Data to and from GPU¶
x = torch.tensor([1.0, 2.0])
x_cuda = x.to('cuda')
# Back to CPU
x_cpu = x_cuda.to('cpu')
⚠️ Tensors must be on the same device for math to work.
❌ CPU-GPU ops will crash with RuntimeError.
17.5 Multi-GPU Usage¶
➤ List all GPUs:¶
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_name(i))
➤ Move model to a specific GPU:¶
model = model.to('cuda:1')
➤ Use DataParallel (basic multi-GPU training):¶
from torch.nn import DataParallel
model = DataParallel(model)
model = model.to('cuda')
✅ Automatically splits input batches
For large-scale training, DistributedDataParallel is preferred (coming in advanced chapters)
17.6 Memory Management and Stats¶
➤ Track VRAM usage:¶
torch.cuda.memory_allocated()
torch.cuda.memory_reserved()
➤ Free unused memory:¶
torch.cuda.empty_cache()
This won’t free memory from PyTorch internally, but it makes it available to other applications.
17.7 Common CUDA Pitfalls¶
Pitfall | Fix |
---|---|
Mixing CPU and GPU tensors | .to(device) both inputs before operations |
Forgetting .to(device) on model |
Model stays on CPU → loss never goes down |
Out of memory (OOM) | Reduce batch size or use with torch.no_grad() |
CUDA slower than CPU (tiny model) | CUDA overhead may outweigh benefits |
GPU idle, CPU overloaded | Use num_workers in DataLoader + pin_memory |
17.8 AMP (Automatic Mixed Precision)¶
For faster training with less memory usage:
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for input, target in dataloader:
optimizer.zero_grad()
with autocast():
output = model(input)
loss = loss_fn(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
AMP = ⚡ Speed + 💾 Efficiency without major code rewrites.
17.9 Benchmark Settings (CuDNN)¶
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
benchmark=True
to let PyTorch auto-optimize conv performance
- Set
deterministic=True
if you need exact reproducibility
17.10 Summary¶
Action | Code Example |
---|---|
Set device | device = torch.device("cuda") |
Move tensor/model | .to(device) |
Multi-GPU (basic) | torch.nn.DataParallel(model) |
Monitor memory usage | memory_allocated(), empty_cache() |
Mixed precision training | torch.cuda.amp.autocast() |
-
GPUs = speed — use them wisely
-
.to(device) is your best friend — for tensors, models, inputs, and labels
-
Track your memory, use AMP for large models, and never mix CPU + CUDA in a single operation