Chapter 8: Understanding CNN Layers¶
“Every filter is a lens. Every layer is a language. A CNN doesn’t just see—it interprets.”
Why This Chapter Matters¶
A Convolutional Neural Network is more than a stack of layers—it’s a hierarchy of abstractions. With each convolution, pooling, and activation, your model goes from low-level pixels to high-level semantics:
- Edge → Shape → Texture → Object
But to design effective CNNs (and debug them), you need to understand how each layer transforms the input.
This chapter walks you through:
- What each major CNN layer does
- How it changes shape, depth, and meaning
- How to implement and visualize these layers in PyTorch and TensorFlow
You’ll finally understand why a 224×224×3 image turns into a 7×7×512 feature map.
Conceptual Breakdown¶
🔹 The Core CNN Layer Types¶
Layer | Function |
---|---|
Conv2D | Applies a filter/kernel over spatial regions |
Activation (ReLU) | Adds non-linearity so the network can learn complex patterns |
BatchNorm | Normalizes activations to stabilize training |
Pooling | Reduces spatial size while keeping key features |
Dropout | Prevents overfitting by randomly dropping activations |
Fully Connected | Maps final features to output classes |
🔹 Convolution Layer: Conv2D
¶
- Uses a kernel (e.g., 3×3) that slides across the image
- Performs element-wise multiplications and adds up the result
- Outputs a feature map
📌 A convolution layer doesn’t see the entire image—it sees a window. As we stack layers, the receptive field grows.
Key parameters:
in_channels
: number of input feature channelsout_channels
: number of filters (i.e., output channels)kernel_size
: size of each filter (e.g., 3×3)stride
: how much the filter moves per steppadding
: how edges are handled (valid vs same)
🔹 Pooling Layer: MaxPool2D
, AvgPool2D
¶
- Downsamples feature maps (e.g., from 32×32 → 16×16)
- Keeps strongest signals (MaxPooling) or averages regions (AvgPooling)
- Reduces computation and helps detect patterns invariant to position
🔹 Batch Normalization¶
- Normalizes output of a layer to have zero mean, unit variance
- Stabilizes training, allows for higher learning rates
- Applied after convolution, before activation
🔹 Activation Functions: ReLU and Beyond¶
Activation | Formula | Purpose |
---|---|---|
ReLU | max(0, x) |
Introduces non-linearity |
Leaky ReLU | max(αx, x) |
Keeps small negative slope |
Sigmoid | 1 / (1 + e^-x) |
Squeezes to [0, 1] |
📌 Most modern CNNs use ReLU for its simplicity and efficiency.
🔹 Fully Connected (Dense) Layers¶
After several convolution + pooling blocks, the feature map is flattened into a vector and passed through one or more Linear
(PyTorch) or Dense
(TF) layers.
- Used to classify based on the features extracted earlier
- Last layer’s size = number of classes
PyTorch Implementation¶
Let’s build a simple Conv → ReLU → Pool block:
import torch.nn as nn
cnn_block = nn.Sequential(
nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1), # [B, 3, 224, 224] → [B, 16, 224, 224]
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2) # [B, 16, 224, 224] → [B, 16, 112, 112]
)
A full model:
class SimpleCNN(nn.Module):
def __init__(self):
super().__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 16, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2),
nn.Conv2d(16, 32, 3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2)
)
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(32 * 56 * 56, 10) # assuming input was 224x224
)
def forward(self, x):
x = self.features(x)
return self.classifier(x)
TensorFlow Implementation¶
from tensorflow.keras import layers, models
model = models.Sequential([
layers.Conv2D(16, (3, 3), padding='same', input_shape=(224, 224, 3)),
layers.BatchNormalization(),
layers.ReLU(),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Conv2D(32, (3, 3), padding='same'),
layers.ReLU(),
layers.MaxPooling2D(pool_size=(2, 2)),
layers.Flatten(),
layers.Dense(10)
])
How Shapes Change¶
Operation | PyTorch Shape Change | TensorFlow Shape Change |
---|---|---|
Conv2D | [B, C_in, H, W] → [B, C_out, H, W] |
[B, H, W, C_in] → [B, H, W, C_out] |
MaxPool2D (2×2) | [B, C, H, W] → [B, C, H/2, W/2] |
[B, H, W, C] → [B, H/2, W/2, C] |
Flatten | [B, C, H, W] → [B, C×H×W] |
[B, H, W, C] → [B, H×W×C] |
Framework Comparison Table¶
Layer | PyTorch | TensorFlow |
---|---|---|
Convolution | nn.Conv2d(in, out, k) |
layers.Conv2D(filters, k) |
Pooling | nn.MaxPool2d(k) |
layers.MaxPooling2D(k) |
BatchNorm | nn.BatchNorm2d(channels) |
layers.BatchNormalization() |
Activation (ReLU) | nn.ReLU() or F.relu() |
layers.ReLU() or inline |
Fully Connected | nn.Linear(in, out) |
layers.Dense(units) |
Flatten | nn.Flatten() |
layers.Flatten() |
Mini-Exercise¶
Build a mini CNN with:
- 2 Conv2D layers
- ReLU and MaxPooling after each
-
Flatten + Dense to output 10 classes
-
Feed a dummy input of shape
[1, 3, 224, 224]
(PyTorch) or[1, 224, 224, 3]
(TF) - Print the shape after each layer
- Try replacing ReLU with LeakyReLU—observe differences
Bonus: Visualize the first convolutional layer filters (we’ll expand this in Chapter 17!)