Chapter 13: Loss Functions and Optimizers¶

“A network doesn’t improve by magic—it learns by failing. The loss is the pain, the optimizer is the cure.”

Why This Chapter Matters¶

When training a CNN, we want it to:

Make better predictions over time
Improve by adjusting its weights through gradients

But how do we quantify wrong predictions? That’s where loss functions come in.

And once we have loss, how do we adjust the network to reduce it? That’s the job of optimizers.

Together, they are the learning engine of your CNN:

The loss function tells the model how wrong it is.
The optimizer updates the weights to make it less wrong next time.

Understanding both is vital for:

Choosing the right learning strategy
Debugging model collapse or instability
Fine-tuning pretrained models

Conceptual Breakdown¶

🔹 What Is a Loss Function?¶

A loss function computes the difference between:

The model’s predicted output (logits/probabilities)
The true label (target)

The output is a scalar, which is differentiable so gradients can flow.

🔹 Types of Loss Functions¶

Loss Function	Use Case	PyTorch	TensorFlow
CrossEntropyLoss	Multi-class classification	`nn.CrossEntropyLoss()`	`SparseCategoricalCrossentropy()`
BCEWithLogitsLoss	Binary classification (with logits)	`nn.BCEWithLogitsLoss()`	`BinaryCrossentropy(from_logits=True)`
MSELoss	Regression or feature matching	`nn.MSELoss()`	`MeanSquaredError()`

📌 For classification tasks:

Use CrossEntropyLoss if your model outputs raw logits
For softmax outputs, use CategoricalCrossentropy without logits

🔹 What Is an Optimizer?¶

An optimizer updates weights using the gradients computed via backpropagation.

Optimizers apply:

Learning rate (lr)
Momentum, adaptive steps, or regularization

🔹 Common Optimizers¶

Optimizer	Behavior	Best For
SGD	Basic gradient descent	Simple, interpretable tasks
SGD + Momentum	Adds velocity to updates	Faster convergence
Adam	Adaptive step size per parameter	Most deep learning tasks
RMSprop	Like Adam but simpler	Good for noisy gradients (e.g., RNNs)

📌 Start with Adam. Move to SGD + Momentum for fine-tuning large models.

🔹 Visualizing Gradient Flow¶

Think of:

Loss as elevation
Gradients as slope
Optimizer as the hiker moving downhill

Bad loss or bad optimizer = stuck in a valley Good setup = smooth descent to a better model

PyTorch Implementation¶

🔸 CrossEntropy Loss + Adam Optimizer¶

import torch
import torch.nn as nn
import torch.optim as optim

model = MiniCNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Dummy training step
for images, labels in train_loader:
    optimizer.zero_grad()        # Reset gradients
    outputs = model(images)      # Forward pass
    loss = criterion(outputs, labels)  # Compute loss
    loss.backward()              # Backpropagation
    optimizer.step()             # Update weights

🔸 Switch to SGD with Momentum¶

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

TensorFlow Implementation¶

🔸 CrossEntropy Loss + Adam Optimizer¶

import tensorflow as tf

model = MiniCNN()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# Dummy training step (eager)
with tf.GradientTape() as tape:
    predictions = model(images, training=True)
    loss = loss_fn(labels, predictions)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

🔸 Use SGD Instead¶

optimizer = tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9)

Framework Comparison Table¶

Component	PyTorch	TensorFlow
Loss Function	`nn.CrossEntropyLoss()`	`tf.keras.losses.SparseCategoricalCrossentropy()`
Loss with logits	Built-in (raw outputs supported)	`from_logits=True`
Optimizer	`optim.Adam(...)`	`tf.keras.optimizers.Adam(...)`
Update weights	`loss.backward()` + `optimizer.step()`	`GradientTape()` + `apply_gradients()`
Zero gradients	`optimizer.zero_grad()`	Automatic inside tape

Mini-Exercise¶

Build a CNN for 10-class image classification
Try both CrossEntropy + Adam and SGD + momentum
Log loss per batch and plot curve after 5 epochs
Try tweaking:
Learning rate (lr)
Loss function (BCEWithLogitsLoss for binary)
Observe: Does one optimizer converge faster? Does one oscillate?

Bonus: Add weight decay (L2 regularization) to SGD and test performance.