Chapter 9: The CNN Vocabulary (Terms Demystified)¶
“Before you build deep networks, build deep understanding. Words like kernel, stride, and feature map aren’t just jargon—they’re the gears of a vision engine.”
Why This Chapter Matters¶
If you’ve ever wondered:
- “What exactly is a kernel?”
- “How do channels differ from filters?”
- “Why does stride affect output shape?”
- “What’s the difference between padding types?”
… then this chapter is for you.
Clear understanding of these terms helps you:
- Design architectures confidently
- Avoid shape mismatch bugs
- Communicate ideas and debug issues quickly
- Understand pretrained model behavior
Conceptual Breakdown¶
Let’s define and visually ground each essential CNN term.
🔹 Kernel (a.k.a. Filter)¶
What it is: A small matrix (e.g., 3×3 or 5×5) that slides across the image, performing local dot products.
- Each kernel learns to detect a pattern (e.g., edge, curve, texture)
- A Conv2D layer contains many kernels—one per output channel
Think of a kernel as the "eye" scanning a small area.
Size | Meaning |
---|---|
1×1 | Channel-wise projection |
3×3 | Local feature extraction |
5×5 | More context, costlier |
🔹 Stride¶
What it is: The number of pixels the kernel moves each time.
- Stride = 1 → overlapping windows
- Stride = 2 → skips every other pixel, downsamples output
Stride controls spatial resolution of the output.
🔹 Padding¶
What it is: How we handle the edges of the image.
Type | Description |
---|---|
Valid | No padding (output shrinks) |
Same | Pads so output shape matches input (if stride=1) |
Custom | Manually pad with specific values |
📌 In PyTorch: padding=1
for 3×3 kernel maintains shape
📌 In TensorFlow: use padding='same'
or 'valid'
🔹 Input/Output Channels¶
Input Channels: Number of channels in the incoming tensor Output Channels: Number of filters (each outputs a channel)
Layer | Input Shape | Output Shape |
---|---|---|
Conv2D | [B, 3, H, W] (RGB) |
[B, 64, H, W] (64 filters) |
Every output channel corresponds to one kernel applied across all input channels.
🔹 Feature Maps¶
What it is: The output of a convolution layer—a 2D activation map showing how strongly a feature was detected in different regions.
- Early layers: feature maps detect edges, corners
- Deeper layers: feature maps detect eyes, wheels, textures
📌 Feature maps = filtered views of the image.
🔹 Receptive Field¶
What it is: The effective area of the original input that a neuron “sees.”
- Grows with depth
- A neuron in a deep layer might “see” the entire image
A large receptive field = global understanding Small receptive field = local detail
🔹 Channel Depth vs Spatial Dimensions¶
Property | Meaning |
---|---|
Spatial size | Height × Width (resolution) |
Depth | Number of feature channels |
Example: [32, 128, 128]
= 32 filters, 128×128 resolution per map
🔹 Layer Variants¶
Term | Meaning |
---|---|
ReflectionPad2d | Pads by mirroring the image at the edge (used in style transfer) |
InstanceNorm2d | Like BatchNorm, but per image-instance (used in image generation tasks) |
AdaptiveAvgPool2d | Automatically resizes output to fixed size regardless of input size |
These are powerful tools when building style-transfer, GANs, or segmentation models.
PyTorch Examples¶
import torch.nn as nn
# 3x3 conv, same output shape
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
# Pooling
pool = nn.MaxPool2d(kernel_size=2, stride=2)
# Adaptive pooling to 1×1 (useful before a Dense layer)
adaptive_pool = nn.AdaptiveAvgPool2d((1, 1))
# Reflection padding (e.g., style transfer)
pad = nn.ReflectionPad2d(2)
# Instance normalization (used in generator networks)
norm = nn.InstanceNorm2d(16)
TensorFlow Examples¶
from tensorflow.keras import layers
# Conv with SAME padding
conv = layers.Conv2D(16, kernel_size=3, padding='same')
# Max Pooling
pool = layers.MaxPooling2D(pool_size=(2, 2), strides=2)
# Adaptive pooling (Global Average Pool)
adaptive = layers.GlobalAveragePooling2D()
# Reflection padding: must be done manually
padded = tf.pad(input_tensor, [[0, 0], [2, 2], [2, 2], [0, 0]], mode='REFLECT')
# Instance norm (use tf_addons or custom layer)
Framework Comparison Table¶
Concept | PyTorch | TensorFlow |
---|---|---|
Conv2D | nn.Conv2d(in, out, k) |
layers.Conv2D(out, k, padding=...) |
Padding (same) | padding=1 (for 3×3) |
padding='same' |
Adaptive pooling | AdaptiveAvgPool2d(output_size) |
GlobalAveragePooling2D() |
InstanceNorm | nn.InstanceNorm2d() |
Addons/custom implementation |
Reflection padding | nn.ReflectionPad2d(pad) |
tf.pad(..., mode='REFLECT') |
Mini-Exercise¶
Choose an image and:
-
Manually implement:
-
A 3×3 Conv2D with stride 1 and padding 1
- A MaxPool2D with stride 2
-
A GlobalAveragePooling layer
-
Print the shape of each output step-by-step
-
Visualize:
-
The input
- The output feature maps of the first convolution
Bonus: Try using AdaptiveAvgPool2d((1, 1))
to make your model input-shape agnostic.