Chapter 12: Hierarchical Clustering¶
"Hierarchies reveal not just groups, but the relationships between groups—the family tree of data."
Learning Objectives¶
By the end of this chapter, you will be able to:
- Understand the principles of hierarchical clustering and different linkage methods
- Interpret dendrograms and determine optimal cluster numbers
- Implement agglomerative clustering in scikit-learn with appropriate parameters
- Compare different linkage methods and their effects on clustering results
- Apply hierarchical clustering to real datasets and visualize cluster hierarchies
Intuitive Introduction¶
Imagine you're organizing a family reunion. You start by grouping immediate family members, then combine families into larger clans, and finally unite clans into the entire extended family. This hierarchical approach creates a "family tree" showing relationships at different levels.
Hierarchical clustering works similarly. It builds a hierarchy of clusters by either: - Agglomerative (bottom-up): Start with individual points and merge them into larger clusters - Divisive (top-down): Start with one big cluster and split it into smaller ones
The result is a tree-like structure called a dendrogram that shows how clusters are nested within each other. This approach doesn't require you to specify the number of clusters upfront and provides rich information about cluster relationships.
Mathematical Development¶
Hierarchical clustering creates a hierarchy of clusters by iteratively merging or splitting clusters based on their similarity.
Distance Metrics¶
The foundation of hierarchical clustering is measuring distance between clusters. Common distance metrics include:
Single Linkage (Minimum): Distance between closest points in different clusters \(\(d(C_i, C_j) = \min_{x \in C_i, y \in C_j} ||x - y||\)\)
Complete Linkage (Maximum): Distance between farthest points in different clusters \(\(d(C_i, C_j) = \max_{x \in C_i, y \in C_j} ||x - y||\)\)
Average Linkage: Average distance between all pairs of points \(\(d(C_i, C_j) = \frac{1}{|C_i| \cdot |C_j|} \sum_{x \in C_i} \sum_{y \in C_j} ||x - y||\)\)
Centroid Linkage: Distance between cluster centroids \(\(d(C_i, C_j) = ||\mu_i - \mu_j||\)\)
Where \(\mu_i\) is the centroid (mean) of cluster \(C_i\).
The Agglomerative Algorithm¶
- Start with each data point as its own cluster
- Find the two closest clusters
- Merge them into a single cluster
- Update the distance matrix
- Repeat until only one cluster remains
Dendrograms¶
A dendrogram is a tree diagram showing the hierarchical relationships: - Height: Represents the distance between merged clusters - Leaves: Individual data points - Branches: Show cluster merging history
The y-axis represents the linkage distance, helping determine where to "cut" the tree for optimal clustering.
For web sources on hierarchical clustering: - Scikit-learn documentation: https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering - Wikipedia: Hierarchical clustering algorithms
Implementation Guide¶
Scikit-learn provides AgglomerativeClustering
for hierarchical clustering. Let's explore its usage:
Basic Usage¶
from sklearn.cluster import AgglomerativeClustering
import numpy as np
# Sample data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Perform clustering
clustering = AgglomerativeClustering(n_clusters=2)
labels = clustering.fit_predict(X)
print("Cluster labels:", labels)
Key Parameters:
n_clusters
: Number of clusters to find (optional)- If None, builds full hierarchy
-
Default: 2
-
linkage
: Linkage criterion 'ward'
: Minimize variance of clusters (default, requires Euclidean distance)'complete'
: Complete linkage (maximum distance)'average'
: Average linkage-
'single'
: Single linkage (minimum distance) -
affinity
: Distance metric 'euclidean'
(default)'l1'
,'l2'
,'manhattan'
,'cosine'
,'precomputed'
-
Auto-selected based on linkage for Ward
-
distance_threshold
: Linkage distance threshold - Alternative to
n_clusters
- Clusters below this distance are merged
Advanced Usage¶
# Using distance threshold instead of n_clusters
clustering = AgglomerativeClustering(
n_clusters=None,
distance_threshold=1.5,
linkage='complete'
)
labels = clustering.fit_predict(X)
# Get number of clusters found
n_clusters_found = clustering.n_clusters_
print(f"Number of clusters found: {n_clusters_found}")
Parameter Interactions:
linkage='ward'
requiresaffinity='euclidean'
distance_threshold
andn_clusters
are mutually exclusive- Ward linkage tends to create equally-sized clusters
Practical Applications¶
Let's apply hierarchical clustering to the Wine dataset to demonstrate the technique:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import dendrogram, linkage
# Load and prepare data
wine = load_wine()
X = wine.data
y_true = wine.target
feature_names = wine.feature_names
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Perform hierarchical clustering with different linkage methods
linkage_methods = ['single', 'complete', 'average', 'ward']
results = {}
for method in linkage_methods:
clustering = AgglomerativeClustering(
n_clusters=3, # We know there are 3 wine classes
linkage=method
)
labels = clustering.fit_predict(X_scaled)
ari = adjusted_rand_score(y_true, labels)
results[method] = {'labels': labels, 'ari': ari}
# Create linkage matrices for dendrograms
linkage_matrices = {}
for method in linkage_methods:
if method == 'ward':
# Ward requires Euclidean distance
linkage_matrices[method] = linkage(X_scaled, method=method, metric='euclidean')
else:
linkage_matrices[method] = linkage(X_scaled, method=method)
# Plot dendrograms and compare results
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for i, method in enumerate(linkage_methods):
# Dendrogram
plt.subplot(2, 4, i+1)
dendrogram(linkage_matrices[method], truncate_mode='level', p=3)
plt.title(f'{method.capitalize()} Linkage Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
# Performance comparison
plt.subplot(2, 4, i+5)
labels = results[method]['labels']
ari = results[method]['ari']
# Scatter plot of first two features colored by clusters
scatter = plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis', alpha=0.7)
plt.xlabel(f'Standardized {feature_names[0]}')
plt.ylabel(f'Standardized {feature_names[1]}')
plt.title(f'{method.capitalize()} Clustering\nARI: {ari:.3f}')
plt.colorbar(scatter)
plt.tight_layout()
plt.show()
# Print detailed results
print("Clustering Performance Comparison:")
print("-" * 40)
for method, result in results.items():
ari = result['ari']
print(f"{method.capitalize():>10}: ARI = {ari:.3f}")
# Analyze cluster characteristics for best method (Ward)
best_method = max(results.keys(), key=lambda x: results[x]['ari'])
best_labels = results[best_method]['labels']
print(f"\nBest Method: {best_method.capitalize()}")
print("Cluster Analysis:")
for cluster in range(3):
cluster_points = X[best_labels == cluster]
print(f"\nCluster {cluster}: {len(cluster_points)} samples")
print(f" Mean alcohol: {cluster_points[:, 0].mean():.2f}")
print(f" Mean malic acid: {cluster_points[:, 1].mean():.2f}")
print(f" Mean ash: {cluster_points[:, 2].mean():.2f}")
Interpreting Results:
The analysis shows different linkage methods produce varying results: - Ward linkage typically performs best (highest ARI score) - Single linkage creates "stringy" clusters, sensitive to noise - Complete linkage creates compact, equally-sized clusters - Average linkage provides balanced performance
The dendrograms reveal the hierarchical structure, with cutting at different heights producing different cluster numbers.
Expert Insights¶
Choosing Linkage Methods¶
Single Linkage: - Pros: Handles non-spherical clusters, good for detecting outliers - Cons: Sensitive to noise, creates "chaining" effect - Use when: Data has arbitrary-shaped clusters
Complete Linkage: - Pros: Creates compact clusters, less sensitive to outliers - Cons: Breaks large clusters, sensitive to outliers within clusters - Use when: You want tight, spherical clusters
Average Linkage: - Pros: Balanced approach, less sensitive to outliers - Cons: Can be computationally expensive - Use when: General-purpose clustering
Ward Linkage: - Pros: Tends to create equally-sized clusters, works well with Euclidean distance - Cons: Assumes spherical clusters, requires Euclidean metric - Use when: You want balanced cluster sizes
Dendrogram Interpretation¶
Cutting the Tree: - Horizontal cuts: Determine cluster membership - Height of cuts: Indicates cluster similarity - Branch lengths: Show within-cluster vs between-cluster distances
Optimal Cutting: - Look for large gaps in linkage distances - Consider domain knowledge about expected cluster numbers - Use silhouette analysis to validate cuts
Computational Considerations¶
Complexity: O(n²) space and O(n³) time for full hierarchy
- Advantages: No need to specify k upfront, rich hierarchical information
- Limitations: Doesn't scale well to large datasets (>10,000 samples)
- Solutions: Use distance_threshold
to stop early, or sample data
When to Use Hierarchical Clustering¶
Advantages: - No need to specify number of clusters upfront - Provides hierarchical relationships - Deterministic results (no random initialization) - Works with any distance metric
Disadvantages: - Cannot "unmerge" clusters once created - Computationally expensive for large datasets - Sensitive to choice of linkage method
Alternatives: - K-means: Faster, scales better, but requires k specification - DBSCAN: Handles noise and arbitrary shapes, no k needed - Gaussian Mixture Models: Probabilistic approach with soft assignments
Self-Check Questions¶
- What is the difference between agglomerative and divisive hierarchical clustering?
- Why does Ward linkage require Euclidean distance?
- How do you determine the optimal number of clusters from a dendrogram?
- When would you choose single linkage over complete linkage?
Try This Exercise¶
Hierarchical Clustering on Customer Data
- Create or load customer data with features like age, income, spending score, and recency
- Apply hierarchical clustering with different linkage methods (single, complete, average, ward)
- Visualize dendrograms for each method
- Cut the dendrograms at different heights to create 3-5 clusters
- Compare cluster characteristics across linkage methods
- Analyze which linkage method creates the most interpretable customer segments
Expected Outcome: You'll understand how linkage methods affect cluster formation and learn to interpret dendrograms for business insights.
Builder's Insight¶
Hierarchical clustering offers a unique perspective on data relationships. While other clustering methods give you flat partitions, hierarchical methods reveal the nested structure of your data—the "Russian dolls" of groupings within groupings.
This approach is particularly valuable when you need to understand not just what groups exist, but how those groups relate to each other. The dendrogram becomes a map of your data's natural hierarchy, guiding decisions about granularity and relationships.
As you build clustering systems, remember that the "right" linkage method depends on your data's structure and your analytical goals. Ward linkage often works well for business data, while single linkage excels at finding unusual patterns. The key is understanding these trade-offs and choosing the method that best serves your analytical narrative.
Mastering hierarchical clustering adds depth to your analytical toolkit, enabling you to uncover not just clusters, but the stories they tell about your data's underlying structure.