Chapter 11: K-Means Clustering¶

"Groups emerge not from similarity of appearance, but from shared patterns of behavior."

Learning Objectives¶

By the end of this chapter, you will be able to:

Understand the mathematical foundations of K-means clustering and the concept of centroids
Implement K-means clustering in scikit-learn with appropriate parameter selection
Apply the elbow method and silhouette analysis for determining optimal cluster numbers
Visualize and interpret clustering results in feature space
Recognize when K-means is appropriate and its limitations

Intuitive Introduction¶

Imagine you're organizing a library of books. You don't have predefined categories, but you notice books naturally group together based on their content. Mystery novels cluster near each other, science books form another group, and cookbooks gather in a third area. This is clustering in action—finding natural groupings in data without predefined labels.

K-means clustering works similarly. It finds natural groupings by iteratively assigning data points to the nearest "centroid" (the center of a cluster) and then recalculating those centroids based on the assigned points. It's like placing k pins on a map and letting gravity pull them to the centers of population density.

This algorithm is particularly powerful for customer segmentation, image compression, and anomaly detection. However, it assumes spherical clusters of similar sizes and struggles with non-linear boundaries or varying cluster densities.

Mathematical Development¶

K-means clustering aims to partition n observations into k clusters where each observation belongs to the cluster with the nearest mean (centroid).

The Objective Function¶

The algorithm minimizes the within-cluster sum of squares (WCSS):

\[J = \sum_{i=1}^{k} \sum_{x \in S_i} ||x - \mu_i||^2\]

Where: - \(k\) is the number of clusters - \(S_i\) is the set of points in cluster i - \(\mu_i\) is the centroid (mean) of cluster i - \(||x - \mu_i||^2\) is the squared Euclidean distance

The Algorithm¶

Initialization: Choose k initial centroids (randomly or using k-means++)
Assignment: Assign each point to the nearest centroid
Update: Recalculate centroids as the mean of points in each cluster
Repeat: Steps 2-3 until convergence (centroids don't change significantly)

K-means++ Initialization¶

To avoid poor initialization, k-means++ selects initial centroids with probability proportional to their squared distance from existing centroids:

\[P(x) = \frac{D(x)^2}{\sum_{x'} D(x')^2}\]

Where \(D(x)\) is the distance to the nearest existing centroid.

Convergence¶

The algorithm converges when the centroids stabilize. In practice, we stop when: - Centroids change by less than a threshold, or - Maximum iterations reached, or - No points change cluster assignment

For web sources on K-means mathematics: - Scikit-learn documentation: https://scikit-learn.org/stable/modules/clustering.html#k-means - Original paper: "k-means++: The advantages of careful seeding" by Arthur and Vassilvitskii

Implementation Guide¶

Scikit-learn's KMeans class provides a robust implementation of the algorithm. Let's explore its usage:

Basic Usage¶

from sklearn.cluster import KMeans
import numpy as np

# Create sample data
X = np.array([[1, 2], [1, 4], [1, 0],
              [4, 2], [4, 4], [4, 0]])

# Fit K-means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

print("Cluster labels:", labels)
print("Centroids:", centroids)

Key Parameters:

n_clusters: Number of clusters to form (required)
Default: 8
Must be specified; no automatic determination
init: Method for centroid initialization
'k-means++' (default): Smart initialization to avoid poor local minima
'random': Random initialization
ndarray: Pre-specified initial centroids
n_init: Number of random initializations
Default: 10
Higher values reduce chance of poor local optima but increase computation
max_iter: Maximum iterations per initialization
Default: 300
Usually converges much faster
tol: Tolerance for convergence
Default: 1e-4
Stop when centroids move less than this distance
random_state: Random seed for reproducibility
Important for consistent results

Advanced Usage¶

# With custom initialization
initial_centroids = np.array([[0, 0], [5, 5]])
kmeans = KMeans(n_clusters=2, init=initial_centroids, n_init=1)

# Predicting on new data
new_data = np.array([[2, 3], [3, 1]])
new_labels = kmeans.predict(new_data)

# Getting inertia (within-cluster sum of squares)
inertia = kmeans.inertia_
print(f"Final inertia: {inertia}")

Parameter Interactions:

n_init and init: Use n_init=1 with custom init array
max_iter and tol: Balance between accuracy and speed
random_state: Always set for reproducible results

Practical Applications¶

Let's apply K-means to the classic Iris dataset to demonstrate clustering in action:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score

# Load and prepare data
iris = load_iris()
X = iris.data
feature_names = iris.feature_names

# Standardize features (important for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Determine optimal k using elbow method
inertias = []
k_range = range(1, 11)

for k in k_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

# Plot elbow curve
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(k_range, inertias, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.grid(True)

# Silhouette analysis for k=2,3,4
silhouette_scores = []
k_values = [2, 3, 4]

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)

plt.subplot(1, 3, 2)
plt.bar(k_values, silhouette_scores)
plt.xlabel('Number of clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.ylim(0, 1)

# Final clustering with k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
centroids = kmeans.cluster_centers_

# Visualize first two features
plt.subplot(1, 3, 3)
scatter = plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200, linewidth=3)
plt.xlabel(f'Standardized {feature_names[0]}')
plt.ylabel(f'Standardized {feature_names[1]}')
plt.title('K-means Clustering (k=3)')
plt.colorbar(scatter)

plt.tight_layout()
plt.show()

# Compare with true labels
from sklearn.metrics import adjusted_rand_score

true_labels = iris.target
ari_score = adjusted_rand_score(true_labels, labels)
print(f"Adjusted Rand Index: {ari_score:.3f}")

# Analyze cluster characteristics
for cluster in range(3):
    cluster_points = X[labels == cluster]
    print(f"\nCluster {cluster}:")
    print(f"  Size: {len(cluster_points)} points")
    print(f"  Mean sepal length: {cluster_points[:, 0].mean():.2f}")
    print(f"  Mean sepal width: {cluster_points[:, 1].mean():.2f}")
    print(f"  Mean petal length: {cluster_points[:, 2].mean():.2f}")
    print(f"  Mean petal width: {cluster_points[:, 3].mean():.2f}")

Interpreting Results:

The elbow method shows a clear "elbow" at k=3, suggesting this is the optimal number of clusters. The silhouette score peaks at k=2 but remains reasonable at k=3. The clustering achieves an Adjusted Rand Index of ~0.62, indicating moderate agreement with the true species labels.

Each cluster shows distinct characteristics: - Cluster 0: Small flowers (setosa-like) - Cluster 1: Large flowers (virginica-like)
- Cluster 2: Medium flowers (versicolor-like)

Expert Insights¶

Choosing K¶

Elbow Method: Plot inertia vs k, look for the "elbow" where adding clusters gives diminishing returns.

Silhouette Analysis: Measures how similar points are to their cluster vs other clusters. Values > 0.5 indicate good clustering.

Domain Knowledge: Sometimes k is known from business requirements (e.g., customer segments).

Limitations and Pitfalls¶

Assumptions: K-means assumes: - Spherical clusters of similar size - Equal variance in all directions - Euclidean distance is appropriate

Common Issues: - Sensitive to initialization (use k-means++) - Can converge to local optima (use multiple n_init) - Struggles with non-convex clusters - Doesn't handle categorical features well

Alternatives: - Gaussian Mixture Models: For elliptical clusters - DBSCAN: For arbitrary-shaped clusters and noise handling - Hierarchical Clustering: For nested cluster structures

Scaling and Performance¶

Computational Complexity: O(n * k * i * d) where n=samples, k=clusters, i=iterations, d=dimensions

Scaling Tips: - Use mini-batch K-means for large datasets - Reduce dimensions with PCA first - Consider approximate methods for very large n

Validation Metrics¶

Internal Metrics (unsupervised): - Inertia: Within-cluster sum of squares - Silhouette Score: Cluster cohesion vs separation - Calinski-Harabasz Index: Ratio of between-cluster to within-cluster dispersion

External Metrics (when true labels available): - Adjusted Rand Index (ARI) - Adjusted Mutual Information (AMI) - Homogeneity/Completeness/V-measure

Self-Check Questions¶

Why is it important to standardize features before applying K-means?
How does k-means++ initialization improve upon random initialization?
What does a high silhouette score indicate about cluster quality?
When would you choose DBSCAN over K-means?

Try This Exercise¶

Customer Segmentation with K-means

Load a customer dataset (or create synthetic data with features like age, income, spending score)
Apply K-means with different values of k (2-8)
Use the elbow method and silhouette analysis to determine optimal k
Visualize the clusters in 2D (use PCA if high-dimensional)
Analyze the characteristics of each customer segment
Compare results with different initialization methods

Expected Outcome: You'll identify distinct customer segments and understand how initialization and k selection affect clustering results.

Builder's Insight¶

K-means clustering represents the intersection of mathematical elegance and practical utility. Its simplicity belies its power—most clustering problems can benefit from starting with K-means as a baseline.

Remember: Clustering is often more art than science. The "right" number of clusters depends on your application. Use multiple validation techniques and domain knowledge to guide your decisions.

As you build ML systems, clustering will become a fundamental tool in your arsenal. It reveals hidden patterns in data that supervised methods can't discover. Master the mathematics, understand the limitations, and you'll find clustering invaluable for exploratory data analysis and feature engineering.

The key insight: Sometimes the most valuable discoveries come not from predicting known outcomes, but from uncovering unknown structures in your data.