Chapter 1: How a Neural Network Sees an Image¶
“Before the model learns, it sees. Before it classifies, it computes. And what it sees—starts with pixels, channels, and shapes.”
Why This Chapter Matters¶
Every computer vision journey begins with an image. But here’s the twist: your neural network doesn’t see an image the way you do. It sees numbers. And not just any numbers—tensors of pixel values, reshaped and normalized to fit the model’s expectations.
If you’ve ever run into errors like:
-
“Expected 3 channels, got 1”
-
“Shape mismatch: [1, 224, 224, 3] vs [3, 224, 224]”
-
“Model output is garbage despite clean code”
…then it probably started here: the image-to-tensor pipeline wasn’t correctly handled.
In this chapter, we’ll unpack the complete transformation from a JPEG or PNG file on disk to a model-ready tensor in memory. We’ll go step by step—from pixel arrays → float tensors → properly shaped inputs—and explain how frameworks like PyTorch and TensorFlow treat the process differently.
You’ll see what the model sees. And that understanding will anchor everything you build later.
Conceptual Breakdown¶
🔹 What Is an Image in Memory?
To a neural network, an image is just a 3D array—Height, Width, and Color Channels (usually RGB). For grayscale, it’s just H×W. For RGB, it’s H×W×3.
But raw image files (JPEG, PNG) are compressed formats. To use them in training, we:
-
Load the image into memory
-
Convert it to an array of pixel values (0–255)
-
Normalize/scale those values (e.g., 0.0 to 1.0 or with mean/std)
-
Reshape it into a tensor format the model expects
Each step matters. A mismatch in any of these can wreck your model.
🔹 Tensor Layouts: [H, W, C] vs [C, H, W]
Different frameworks use different conventions:
-
TensorFlow uses
[Height, Width, Channels]
-
PyTorch uses
[Channels, Height, Width]
The reason? Internal memory layout optimizations. But for you, it means that converting between these shapes is crucial when preparing images for your models.
🔹 Model Input Shape: Why It Matters
Neural networks are strict about input shape:
-
ResNet, MobileNet, EfficientNet, etc. expect a specific input size and layout
-
Channels must match: grayscale (1), RGB (3), etc.
-
Batch dimension must exist:
[1, C, H, W]
or[1, H, W, C]
Even for a single image, you must simulate a batch—most models don’t accept raw 3D tensors.
🔹 Visual Walkthrough: Image → Tensor → Model
Let’s break down what happens:
Image File (e.g., 'dog.png')
↓
Load into memory (PIL / tf.io / OpenCV)
↓
Convert to NumPy or Tensor (shape: H×W×3)
↓
Normalize (e.g., /255.0 or mean/std)
↓
Transpose (if using PyTorch: → C×H×W)
↓
Add batch dim (→ 1×C×H×W or 1×H×W×C)
↓
Feed to CNN
PyTorch Implementation¶
Here’s how you go from image file to model-ready tensor in PyTorch:
from PIL import Image
import torchvision.transforms as T
# 1. Load image
image = Image.open("dog.png").convert("RGB")
# 2. Define transform pipeline
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(), # Converts to [0,1] and switches to [C, H, W]
T.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225]) # Pretrained model mean/std
])
# 3. Apply transforms
tensor = transform(image) # shape: [3, 224, 224]
# 4. Add batch dimension
input_tensor = tensor.unsqueeze(0) # shape: [1, 3, 224, 224]
TensorFlow Implementation¶
The same pipeline in TensorFlow looks like this:
import tensorflow as tf
# 1. Load image
image = tf.io.read_file("dog.png")
image = tf.image.decode_png(image, channels=3)
# 2. Resize and convert to float32
image = tf.image.resize(image, [224, 224])
image = tf.cast(image, tf.float32) / 255.0
# 3. Normalize
mean = tf.constant([0.485, 0.456, 0.406])
std = tf.constant([0.229, 0.224, 0.225])
image = (image - mean) / std
# 4. Add batch dimension
input_tensor = tf.expand_dims(image, axis=0) # shape: [1, 224, 224, 3]
Framework Comparison Table¶
Step | PyTorch | TensorFlow |
---|---|---|
Load image | PIL.Image.open() |
tf.io.read_file() + tf.image.decode |
Resize | T.Resize((H, W)) |
tf.image.resize() |
Convert to float | T.ToTensor() (scales to 0–1) |
tf.cast(..., tf.float32) / 255.0 |
Normalize | T.Normalize(mean, std) |
Manual: (image - mean) / std |
Layout | [C, H, W] |
[H, W, C] |
Add batch dim | .unsqueeze(0) |
tf.expand_dims(..., axis=0) |
Mini-Exercise¶
Choose any image file and:
-
Load and visualize the original
-
Convert it to a tensor using both PyTorch and TensorFlow
-
Apply normalization
-
Print shape at each step
-
Confirm final shape matches model input requirement
Bonus: Try visualizing the image after normalization. What do the pixel values look like now?