Skip to content

Chapter 16: Loss Functions & Optimizers

Without a compass, even the smartest network gets lost. Loss guides learning. Optimization moves us forward.


In this chapter, we explore two of the most crucial ingredients of any machine learning recipe:

  • Loss functions: Measure how far off our model’s predictions are from the actual values.
  • Optimizers: Algorithms that adjust model parameters to minimize this loss.

By the end, you'll understand:

  • The difference between various loss functions and when to use them
  • How gradients are computed and used
  • Popular optimization algorithms and their trade-offs
  • How to implement custom loss functions and plug them into training

What Is a Loss Function?

A loss function tells us how “bad” our predictions are. It is scalar-valued, allowing TensorFlow to compute gradients via backpropagation.

🔹 Common Losses in TensorFlow

Task Loss Function TensorFlow API
Binary classification Binary Crossentropy tf.keras.losses.BinaryCrossentropy()
Multi-class classification Sparse Categorical Crossentropy tf.keras.losses.SparseCategoricalCrossentropy()
Regression (real values) Mean Squared Error tf.keras.losses.MeanSquaredError()

Example: Sparse Categorical Crossentropy

loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
loss = loss_fn(y_true, y_pred)
- from_logits=True means the model outputs raw values (logits) without softmax.

  • If your model outputs softmax-activated values, set from_logits=False.

What Are Optimizers?

Optimizers update model parameters using gradients computed from the loss. They are essential for gradient descent-based training.

🔹 Popular Optimizers

Optimizer Description Usage
SGD Stochastic Gradient Descent SGD(learning_rate=0.01)
Momentum Adds inertia to SGD SGD(momentum=0.9)
RMSProp Adjusts learning rate based on recent magnitudes RMSprop(learning_rate=0.001)
Adam Combines Momentum + RMSProp Adam(learning_rate=0.001)

Example: Compile with Optimizer

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=['accuracy']
)

Custom Loss Function

Sometimes, built-in loss functions aren’t enough. Here’s how you can define your own:

def custom_mse_loss(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

Plug into model like this:

model.compile(
    optimizer='adam',
    loss=custom_mse_loss
)

Custom Training Loop (Optional Recap)

When not using model.fit(), you need to compute loss and apply gradients manually:

with tf.GradientTape() as tape:
    logits = model(x_batch, training=True)
    loss_value = loss_fn(y_batch, logits)

grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
This gives you full control over training and is often used in research or advanced custom workflows.


Summary

In this chapter, you learned:

  • Loss functions quantify how wrong a model’s predictions are.
  • Optimizers use gradients to update model weights and minimize loss.
  • Adam is a great default optimizer, but others may work better depending on the problem.
  • You can define custom loss functions for flexibility.

Understanding the relationship between loss → gradient → optimizer → new weights is the key to mastering how neural networks learn.