Chapter 11: Model Evaluation Metrics¶

"The map is not the territory, but metrics are our compass in the landscape of model performance."

Learning Objectives¶

By the end of this chapter, you will be able to:

Understand the purpose and limitations of common evaluation metrics for classification models
Compute and interpret accuracy, precision, recall, F1-score, and confusion matrix
Generate and analyze Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves
Identify situations where different metrics provide conflicting information and choose appropriate metrics based on the problem context

Intuitive Introduction¶

Imagine you're building a spam email classifier. Your model correctly identifies 95% of emails as spam or not spam. Sounds great, right? But what if 99% of emails are not spam? Your model could achieve 95% accuracy by simply labeling everything as "not spam"—missing all the spam! This highlights why accuracy alone is insufficient.

In real-world applications, the cost of different types of errors varies. In medical diagnosis, missing a disease (false negative) might be more critical than a false alarm (false positive). In fraud detection, incorrectly flagging legitimate transactions (false positive) could annoy customers, while missing fraud (false negative) could lead to financial loss.

This chapter explores evaluation metrics that go beyond simple accuracy, providing a nuanced view of model performance. We'll start with basic metrics, then move to more sophisticated curve-based evaluations that help us understand trade-offs between different types of errors.

Mathematical Development¶

Classification models make predictions that can be categorized into four outcomes relative to the true labels:

True Positive (TP): Correctly predicted positive class
True Negative (TN): Correctly predicted negative class
False Positive (FP): Incorrectly predicted positive class (Type I error)
False Negative (FN): Incorrectly predicted negative class (Type II error)

These form the foundation of most evaluation metrics.

Basic Metrics¶

Accuracy measures the overall correctness of predictions:

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

While straightforward, accuracy can be misleading in imbalanced datasets where one class dominates.

Precision (also called Positive Predictive Value) measures the accuracy of positive predictions:

\[\text{Precision} = \frac{TP}{TP + FP}\]

Precision answers: "Of all instances predicted as positive, how many were actually positive?"

Recall (also called Sensitivity or True Positive Rate) measures the model's ability to find all positive instances:

\[\text{Recall} = \frac{TP}{TP + FN}\]

Recall answers: "Of all actual positive instances, how many did we correctly identify?"

F1-Score provides a balanced measure combining precision and recall:

\[\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\]

The F1-score is the harmonic mean of precision and recall, giving equal weight to both metrics.

Confusion Matrix¶

The confusion matrix organizes these four outcomes into a table:

	Predicted Negative	Predicted Positive
Actual Negative	True Negative (TN)	False Positive (FP)
Actual Positive	False Negative (FN)	True Positive (TP)

This matrix provides a complete picture of model performance across all prediction outcomes.

Curve-Based Metrics¶

Receiver Operating Characteristic (ROC) Curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings:

\[\text{TPR} = \frac{TP}{TP + FN} = \text{Recall}\]

\[\text{FPR} = \frac{FP}{FP + TN}\]

The Area Under the ROC Curve (AUC-ROC) summarizes the ROC curve's performance. An AUC of 1.0 represents perfect classification, while 0.5 represents random guessing.

Precision-Recall (PR) Curve plots precision against recall at different thresholds. The Area Under the PR Curve (AUC-PR) provides a summary measure, particularly useful for imbalanced datasets.

For web sources on these metrics, see: - Scikit-learn documentation: https://scikit-learn.org/stable/modules/model_evaluation.html - Wikipedia articles on precision and recall, ROC curves

Implementation Guide¶

Scikit-learn provides comprehensive tools for computing these metrics through the sklearn.metrics module. Let's explore the key functions:

Basic Metrics¶

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Assuming y_true and y_pred are your true labels and predictions
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# For multi-class problems, specify average method
precision_macro = precision_score(y_true, y_pred, average='macro')
recall_micro = recall_score(y_true, y_pred, average='micro')

Parameter Explanations:

average: For multi-class problems
'macro': Calculate metrics for each class and average (equal weight)
'micro': Calculate metrics globally by counting total TP, FP, FN
'weighted': Average weighted by class support
None: Return metrics for each class separately

Confusion Matrix¶

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_true, y_pred)
print(cm)

# For visualization
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot()

Classification Report¶

The classification_report function provides a comprehensive summary:

from sklearn.metrics import classification_report

report = classification_report(y_true, y_pred, target_names=['Negative', 'Positive'])
print(report)

This outputs precision, recall, F1-score, and support for each class.

ROC and PR Curves¶

from sklearn.metrics import roc_curve, auc, precision_recall_curve
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay

# Get prediction probabilities (not just classes)
y_prob = model.predict_proba(X_test)[:, 1]  # Probability of positive class

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_true, y_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc)
roc_display.plot()

# Precision-Recall Curve
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
pr_auc = auc(recall, precision)

# Plot PR
pr_display = PrecisionRecallDisplay(precision=precision, recall=recall, average_precision=pr_auc)
pr_display.plot()

Key Parameters:

roc_curve(y_true, y_score): y_score should be prediction probabilities or confidence scores
precision_recall_curve: Similar requirements
pos_label: Specify which class is considered positive (default=1)
average_precision: For multi-class, specify averaging method

Practical Applications¶

Let's apply these metrics to a real dataset. We'll use the breast cancer dataset from scikit-learn to demonstrate evaluation techniques:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, auc, precision_recall_curve)

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Compute basic metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Benign', 'Malignant']))

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_test, y_prob)
pr_auc = auc(recall, precision)

plt.subplot(1, 2, 2)
plt.plot(recall, precision, color='blue', lw=2, label=f'PR curve (AUC = {pr_auc:.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower left")

plt.tight_layout()
plt.show()

Interpreting Results:

In this breast cancer example, we see: - High accuracy (around 97%), but let's examine the confusion matrix - The model correctly identified 105 malignant cases but missed 3 (FN) - It incorrectly flagged 2 benign cases as malignant (FP)

The ROC curve shows excellent discriminative ability (AUC ≈ 0.99), while the PR curve highlights strong performance in the relevant range.

Expert Insights¶

When Metrics Disagree¶

Different metrics can tell conflicting stories about model performance:

High Accuracy, Low Precision/Recall: Common in imbalanced datasets where the model predicts the majority class
High Precision, Low Recall: Conservative models that only predict positive when very confident, missing many true positives
High Recall, Low Precision: Aggressive models that cast a wide net, catching most positives but with many false alarms

Choosing the Right Metric¶

Balanced datasets: Accuracy or F1-score
Imbalanced datasets: Precision, Recall, or F1-score depending on the cost of errors
Medical diagnosis: Prioritize Recall (catch all diseases) or use domain-specific thresholds
Spam detection: Balance Precision (avoid false positives) and Recall
Fraud detection: Often prioritize Recall to catch fraudulent transactions

Threshold Selection¶

Most metrics depend on the classification threshold (default 0.5). In practice:

Use ROC/PR curves to visualize performance across thresholds
Choose threshold based on business requirements (e.g., cost-benefit analysis)
Consider probability calibration for reliable probability estimates

Common Pitfalls¶

Data leakage: Evaluating on training data leads to overly optimistic metrics
Class imbalance: Accuracy can be misleading; use stratified sampling
Multi-class confusion: Micro vs macro averaging can give different results
Threshold dependence: Metrics change with classification threshold

Computational Considerations¶

Most metrics are O(n) in computation time
ROC/PR curve computation involves sorting predictions: O(n log n)
For large datasets, consider sampling for curve plotting

Self-Check Questions¶

Why is accuracy insufficient for evaluating models on imbalanced datasets?
What does a high precision but low recall indicate about a model's behavior?
When would you prefer AUC-PR over AUC-ROC for model evaluation?
How does the choice of classification threshold affect precision and recall?

Try This Exercise¶

Evaluate a Model on an Imbalanced Dataset

Load the credit card fraud dataset from Kaggle (or use make_classification with class imbalance)
Train a logistic regression model
Compute accuracy, precision, recall, and F1-score
Generate ROC and PR curves
Compare performance when using different classification thresholds (0.1, 0.5, 0.9)
Analyze how the confusion matrix changes with threshold

Expected Outcome: You'll observe how accuracy remains high while precision and recall vary significantly, demonstrating the importance of choosing appropriate metrics for imbalanced problems.

Builder's Insight¶

Model evaluation isn't just about picking the "best" number—it's about understanding your model's behavior in the context of your application. A model with 90% accuracy might be perfect for one use case but completely inadequate for another where specific types of errors are costly.

Remember: Your evaluation metrics should reflect the real-world impact of your model's decisions. Choose metrics that align with business objectives, not just mathematical convenience. The most sophisticated model is worthless if it doesn't solve the right problem.

As you progress in your machine learning journey, developing intuition for when and how to apply different metrics will become as important as understanding the algorithms themselves.