Chapter 19: Common Errors and How to Debug Them¶
“The most dangerous bugs are silent. CNNs won’t throw exceptions when they’re wrong—they’ll just smile and mispredict.”
Why This Chapter Matters¶
After training your CNN and setting up a beautiful inference pipeline, everything looks good—but:
- The model always predicts the same class
- Accuracy is much lower than expected
- It fails on real-world images even if training went fine
These are not training bugs. They’re systemic failures often due to:
- Data leakage
- Input misalignment
- Normalization errors
- Inconsistent shapes
- Forgotten mode switching
This chapter equips you with a checklist and mindset to debug confidently.
Conceptual Breakdown¶
🔍 Where Most Bugs Hide¶
Bug Type | Example | Detection Strategy |
---|---|---|
Normalization | Wrong mean/std or none at inference | Compare input histograms before and after |
Shape mismatch | Model expects [3, 224, 224], gets [1, 256, 256] | Print .shape at each step |
Wrong eval mode | Model trained well but fails validation | Ensure model.eval() or training=False |
Data leakage | Same image appears in train and test set | Check file paths, cross-fold splits |
One-class output | Predicts class 0 for everything | Check for class imbalance, final layer, softmax use |
PyTorch Debugging Checklist¶
# 1. Check model mode
print(model.training) # Should be False during inference
# 2. Check input shape
print(image_tensor.shape) # Should be [1, 3, 224, 224]
# 3. Check normalization
print(image_tensor.min(), image_tensor.max()) # Should be ~-1 to 1
# 4. Visual sanity check
import matplotlib.pyplot as plt
plt.imshow(image_tensor.squeeze().permute(1, 2, 0).numpy())
TensorFlow Debugging Checklist¶
# 1. Model mode
logits = model(img, training=False)
# 2. Shape check
print(img.shape) # Should be (1, 224, 224, 3)
# 3. Normalization
print(img.min().numpy(), img.max().numpy()) # Depending on preprocess_input()
# 4. Show input
plt.imshow(img[0] / 2 + 0.5) # Undo [-1,1] normalization for visualization
🛠 Common CNN Errors & Fixes¶
Symptom | Cause | Fix |
---|---|---|
Always predicting one class | Class imbalance, untrained head, no softmax | Use weighted loss, check class distribution, inspect logits |
Very low accuracy at inference | Wrong normalization | Match training transforms exactly |
Crash on inference | Missing batch dim or float32 type | Use .unsqueeze(0) and convert to float32 |
Fails on grayscale / RGBA image | Expecting 3-channel RGB | Convert image .convert("RGB") |
Prediction unstable during validation | Forgot .eval() or no_grad() |
Call model.eval() and use torch.no_grad() |
Shape mismatch in pretrained model | Mismatch in input resolution or output class count | Resize input and adapt output layer |
Test image looks weird | Bad rescale or wrong channel order | Visualize before and after transforms |
Weird outputs in browser/mobile | Tensor converted incorrectly to JS or HTML format | Normalize properly and ensure channels/byte values are valid |
Visual Debugging Techniques¶
1. Visualize Inputs Before and After Transform¶
# PyTorch
plt.subplot(1,2,1); plt.imshow(PIL_img)
plt.subplot(1,2,2); plt.imshow(tensor.permute(1, 2, 0).numpy())
# TensorFlow
plt.imshow(img_tensor[0] / 2 + 0.5) # If normalized to [-1, 1]
2. Log Confidence Scores¶
# PyTorch
probs = torch.softmax(outputs, dim=1)
print(probs.topk(3)) # Top-3 prediction scores
# TensorFlow
probs = tf.nn.softmax(preds).numpy()
print(np.argsort(probs[0])[-3:][::-1]) # Top-3 classes
3. Debug Batch Processing¶
- Ensure same shape per sample
- All batches should be same dtype
- Check shuffling order
Defensive Programming Tips¶
Strategy | Benefit |
---|---|
assert image.shape == (1, 3, 224, 224) |
Prevents hidden shape bugs |
assert img.dtype == torch.float32 |
Ensures model receives valid input |
Logging before/after each step | Makes silent bugs visible |
Try inference on one image manually | Removes complexity, isolates the problem |
Unit test your transforms | Catch errors early |
Framework Comparison Table¶
Debug Task | PyTorch | TensorFlow / Keras |
---|---|---|
Check training/eval mode | model.training |
Explicit training=False flag |
Visualize input | permute(1,2,0) on tensor |
/ 2 + 0.5 if normalized to [-1,1] |
Print prediction scores | softmax(outputs, dim=1) |
tf.nn.softmax() |
One-image inference | unsqueeze(0) and no_grad() |
np.expand_dims() and model.predict() |
Inspect model layers | print(model) or summary() |
model.summary() |
Mini-Exercise¶
Try this on your own CNN project:
- Pick an image from your test set
- Run it through your full pipeline
-
Log:
-
Input shape, dtype
- Image pixel range before/after preprocessing
- Top-3 predicted classes with confidence
-
Visualize:
-
Input image
- Preprocessed tensor
- Activation maps (from Chapter 17)
-
Bonus:
-
Temporarily insert an invalid image (grayscale, wrong shape) and handle it gracefully
🔚 Final Tips: The Debugging Mindset¶
- Assume nothing—even if training went perfectly
- Print and plot often
- Start small: one image, one batch
- Compare to known-good outputs (reference image → known prediction)
- Trace the full pipeline: from raw input to final prediction
What You Can Now Do¶
- Build sanity-checked pipelines that won’t fail silently
- Fix models that seem broken but are just misconfigured
- Gain confidence and trust in your model’s predictions
- Catch and fix systemic bugs before they go live