SVHN Digit Recognition – Plain English

Chapter 1

The Problem & Dataset

What are we trying to do, and what data do we have to do it with?

Context

What is SVHN and why does it matter?

SVHN stands for Street View House Numbers — a dataset of over 600,000 digit images cropped from street-level photos, originally assembled by Google to help automatically transcribe building numbers from Street View imagery.

If you know a building's address number and the street it's on, you can pinpoint its exact location. Automating that transcription at scale — reading the digits off millions of photos — is a real computer vision problem with real-world impact on mapping quality.

In this project we use a subset of the data: 120,000 images across train, validation, and test splits, each a 32×32 pixel greyscale crop centered on a single digit.

Dataset Structure

What does the data look like?

42,000

Training images

60,000

Validation images

18,000

Test images

32×32

Pixels per image (greyscale)

Classes (digits 0–9)

H5 file contents: X_train: (42000, 32, 32) float32 ← pixel arrays X_val: (60000, 32, 32) float32 X_test: (18000, 32, 32) float32 y_train: (42000,) uint8 ← labels 0–9 y_val: (60000,) uint8 y_test: (18000,) uint8

The data arrives as an HDF5 file — a structured binary format that stores large numerical arrays efficiently. Think of it as a very efficient spreadsheet for numbers. Each image is a 32×32 grid of pixel brightness values (0 = black, 255 = white). Each label is simply a number from 0 to 9 saying which digit is in the center of that crop.

One important design choice: the dataset came with a dedicated validation set of 60,000 images — larger than the training set. Rather than carving validation out of training data (which would waste 8,400 training samples), we used this pre-built split directly.

Key Insight Before Modeling

A surprise hiding in the images

Sample images from the dataset — notice the neighboring digits

Each 32×32 crop was built by centering a window on the target digit in the original street photo. But the neighboring digits on either side naturally fall into frame too. An image labeled "2" might visually show "128" or "25" or "72".

The label only refers to the center digit — but the model has to figure that out on its own, from pixel patterns alone. This raised an interesting question explored later in Section 8: could we help the model focus on the center by mathematically dimming the edges before training?

Chapter 2

Exploratory Data Analysis (EDA)

Before building any model, we examine the data to understand its structure, balance, and visual characteristics.

EDA · Class Distribution

Are all digits equally represented?

Number of images per digit class (0–9)

Imbalance ratio (max/min): 1.03 All 10 digit classes (0–9) present Dataset is near-perfectly balanced

An imbalance ratio of 1.03 means the most common digit class has only 3% more images than the least common. That's essentially perfect balance. This matters because a heavily imbalanced dataset would bias a model toward predicting the majority class — making accuracy a misleading metric.

Here, accuracy is a trustworthy measure of real performance across all digits.

EDA · Sample Images per Class

What does each digit actually look like?

5 random examples of each digit (0–9) — notice the diversity within each class

Each row shows five different examples of the same digit. The variation is striking: same digit, completely different fonts, sizes, angles, lighting, and backgrounds. A "7" might be bold or thin, tilted or upright, lit from the left or washed out by sunlight.

This is exactly why deep learning is needed here. A rule-based system ("a 7 has a horizontal bar at the top") would fail almost immediately. A neural network learns the statistical patterns that define a digit across all its real-world variations.

Pairs expected to cause the most confusion: 1 vs 7, 3 vs 8, 5 vs 6 — confirmed later in the confusion matrices.

EDA · Mean Image per Class

Where does each digit's signal actually live in the image?

Average pixel value across all images of each digit — reveals spatial structure

By averaging all images of each digit class, you get a "ghost image" that shows where strokes consistently appear. Bright areas = pixels that are usually bright for that digit. Dark areas = usually dark.

Two critical observations emerge:

1. Digit signal concentrates in the center. The mean images are brighter in the middle and fade toward the edges — confirming the neighboring-digit problem.

2. Background inversion is common. Some images have dark digits on a light background; others are the opposite. The mean image averages these out, producing a blurry center blob rather than a sharp digit shape. This inspired the spatial mask experiment in Section 8.

Chapter 3

Data Preparation

Raw pixel data can't be fed directly to a neural network without some preparation steps.

Preparation Steps

Normalisation, reshaping, and one-hot encoding

Step 1 — Normalise pixel values: Raw: 0–255 (arbitrary brightness units) After: 0.0–1.0 (divide by 255) Why: neural networks train faster and more stably when inputs are on the same scale Step 2 — Reshape for CNNs: ANN input: (42000, 1024) ← flatten 32×32 into one row CNN input: (42000, 32, 32, 1) ← keep spatial grid, add channel dim Step 3 — One-hot encode labels: Before: y = [2, 6, 7, 4, ...] (integer 0–9) After: y = [[0,0,1,0,0,0,0,0,0,0], [0,0,0,0,0,0,1,0,0,0], ...] Why: neural networks output 10 probabilities, not a single number

Normalisation rescales pixel brightness from 0–255 down to 0.0–1.0. This keeps all inputs in the same range, which prevents the network's weights from being dominated by large pixel values and makes gradient descent more numerically stable.

Reshaping differs between ANN and CNN. An ANN receives a flat list of 1,024 numbers (32×32 unrolled). A CNN receives a 2D grid with an extra "channels" dimension — for greyscale images, that's always 1. Keeping the grid intact is what allows CNNs to detect spatial patterns like edges and curves.

One-hot encoding converts "label = 7" into a vector of ten zeros with a single 1 at position 7. The network then learns to output high probability at the correct position.

Chapter 4

ANN Model 1 — The Baseline

We start simple: a fully-connected network that treats each image as a flat list of pixel values.

ANN Model 1 · Architecture

A small, shallow network — intentionally simple

Architecture: Input(1024) → Dense(64) → Dense(32) → Dense(10) Total parameters: 68,010 (265 KB) Input layer: 1,024 pixels (flattened 32×32) Hidden layer 1: 64 neurons — ReLU activation Hidden layer 2: 32 neurons — ReLU activation Output layer: 10 neurons — Softmax (one per digit)

An ANN (Artificial Neural Network) — also called a fully-connected or dense network — works by multiplying every input value by a learned weight, adding them up, and passing the result through an activation function. It does this in layers, each one building a more abstract representation.

The problem: when you flatten a 32×32 image into 1,024 numbers, you lose all spatial information. Pixel 100 no longer "knows" it's next to pixel 101. The network sees a long list, not a picture. This is the fundamental limitation we'll fix with CNNs later.

ANN Model 1 · Training

Training curves — learning progress over 20 epochs

Training vs validation accuracy over 20 epochs

Epoch 1: train 14.2% val 18.8% Epoch 5: train ~52% val ~52% Epoch 10: train ~58% val ~59% Epoch 20: train ~64% val ~64% Final results: Train accuracy: 64.02% Val accuracy: 63.86% Test accuracy: 63.57%

The training and validation curves track closely together — a good sign that the model isn't memorising the training data. However, 64% accuracy means the model gets roughly 1 in 3 predictions wrong. For 10 classes, random guessing would give 10% — so 64% is real learning, just not great learning.

The curves also plateau early, suggesting the model has hit its capacity ceiling. Adding more epochs won't help much — the architecture itself is the bottleneck.

Chapter 5

ANN Model 2 — Deeper with Regularisation

More layers, more capacity, and two techniques to prevent memorisation: Dropout and BatchNorm.

ANN Model 2 · Architecture

A deeper network with regularisation layers

Architecture: Input(1024) → Dense(256) + BatchNorm + ReLU → Dense(128) + BatchNorm + ReLU + Dropout(0.4) → Dense(64) + BatchNorm + ReLU → Dense(32) + ReLU → Dense(10) + Softmax Total parameters: 310,250 (1.18 MB) Trainable: 310,186 Non-trainable: 64 ← BatchNorm statistics

Dropout: during each training step, randomly "switches off" 40% of neurons. This forces the network to learn redundant representations — no single neuron can become essential. At test time, all neurons are active, but their outputs are scaled down. The result: better generalisation to new data.

BatchNorm (Batch Normalisation): after each layer, rescales activations to have zero mean and unit variance. This stabilises training, allows higher learning rates, and slightly regularises the network. The 64 non-trainable parameters store running statistics about the data distribution.

ANN Model 2 · Training

Training curves — 30 epochs

Training vs validation accuracy over 30 epochs

Final results: Train accuracy: 77.59% Val accuracy: 79.73% Test accuracy: 77.33% Improvement over Model 1: +13.57 percentage points (test)

Interesting detail: validation accuracy slightly exceeds training accuracy (79.73% vs 77.59%). This can happen with Dropout — during training, neurons are randomly disabled, making training harder. At validation time, all neurons are active, giving the model full power. It's a sign Dropout is doing its job.

A significant improvement over Model 1 (+13.6 pts), but the ceiling is becoming visible. More depth helps — but without spatial awareness, even a very deep ANN has fundamental limitations.

ANN Model 2 · Results

Classification report & confusion matrix

Confusion matrix — each row is the true digit, each column is what the model predicted

Digit Precision Recall F1 0 0.78 0.81 0.80 1 0.74 0.83 0.79 2 0.76 0.82 0.79 3 0.72 0.73 0.72 4 0.73 0.87 0.79 5 0.76 0.73 0.75 6 0.79 0.75 0.77 7 0.86 0.79 0.82 8 0.78 0.67 0.72 9 0.84 0.71 0.77 Overall accuracy: 77%

The confusion matrix is a grid where the bright squares on the diagonal = correct predictions. Off-diagonal bright spots = where the model confuses one digit for another. The most confused pairs: 3↔8, 5↔6, and 1↔7 — exactly as predicted from the sample images.

Precision: of all the times the model predicted "digit X", how often was it right? Recall: of all the actual "digit X" images, how many did the model catch? F1: the balance between the two. Digit 3 and 8 score lowest on both — the hardest pair to separate without spatial feature detection.

Chapter 6

CNN Model 1 — Spatial Features Enter the Picture

Convolutional Neural Networks keep the 2D structure of the image intact and learn to detect visual features like edges, curves, and strokes.

CNN Model 1 · Architecture

Two convolutional blocks + a dense head

Architecture: Input: (32, 32, 1) Conv Block 1: Conv2D(16, 3×3) → LeakyReLU → Conv2D(16, 3×3) → LeakyReLU → MaxPooling(2×2) Conv Block 2: Conv2D(32, 3×3) → LeakyReLU → Conv2D(32, 3×3) → LeakyReLU → MaxPooling(2×2) Flatten Dense(128) → ReLU → Dense(10) → Softmax Total parameters: 267,306 (1.02 MB)

A convolutional layer slides a small filter (3×3 pixels) across the image and learns to detect a specific visual feature — an edge, a curve, a corner. Multiple filters run in parallel, each detecting something different. The network stacks these to build up from simple features (edges) to complex ones (digit shapes).

LeakyReLU: a variant of the standard activation function. Standard ReLU permanently kills neurons that produce negative values. LeakyReLU passes 10% of the negative signal through, keeping all neurons alive and trainable throughout.

MaxPooling: shrinks the image by half (2×2 → 1 pixel), keeping only the strongest signal in each region. This makes the network position-invariant — a "7" is still a "7" whether it's in the top-left or bottom-right of the crop.

CNN Model 1 · Training — Overfitting Detected

Training curves reveal a classic overfitting problem

Training vs validation accuracy — the gap tells the whole story

Final results: Train accuracy: 99.08% ← near-perfect Val accuracy: 95.17% Test accuracy: 87.30% Gap: train − test = 11.78 pts Verdict: OVERFITTING

The model achieves 99% on training data but only 87% on unseen test data — an 11.78 percentage point gap. This is textbook overfitting: the model has memorised the training examples rather than learning generalizable rules.

The convolutional layers are fine — they share weights spatially, which naturally limits overfitting. The problem is the dense head: two fully-connected layers with no Dropout. Without regularisation, dense layers freely memorise. The fix: add Dropout to the dense head in Model 2.

Chapter 7

CNN Model 2 — The Best Model

Deeper convolutions, BatchNorm throughout, Dropout in the dense head, and a model checkpoint to save the best weights. This is the final, production-quality model.

CNN Model 2 · Architecture

Four convolutional blocks with full regularisation

Architecture: Input: (32, 32, 1) Conv Block 1: Conv2D(16) → LeakyReLU → Conv2D(16) → LeakyReLU → MaxPool Conv Block 2: Conv2D(32) → LeakyReLU → Conv2D(32) → LeakyReLU → MaxPool Conv Block 3: Conv2D(64) → LeakyReLU → BatchNorm Conv Block 4: Conv2D(64) → LeakyReLU → BatchNorm → MaxPool Flatten → Dropout(0.5) Dense(128) → BatchNorm → ReLU → Dropout(0.5) Dense(10) → Softmax Total parameters: 164,362 (642 KB) Trainable: 164,170 Non-trainable: 192 ← BatchNorm stats Note: fewer parameters than CNN Model 1 (267K), yet significantly better performance — proof that regularisation beats raw size.

This model is deeper (4 conv blocks vs 2) but also smaller in total parameters (164K vs 267K) — because BatchNorm and Dropout do more work, the network doesn't need to be as wide. It's not about brute-force size; it's about learning efficiently.

Dropout(0.5) after Flatten: randomly drops 50% of the flattened feature map before the dense layers — this is the key fix from Model 1's overfitting. The dense head can no longer memorise.

A ModelCheckpoint callback saves the weights at the epoch where validation accuracy peaks — so even if the model starts to overfit in later epochs, we keep the best version.

CNN Model 2 · Training

Training curves — clean convergence, no overfitting

Training vs validation accuracy — near-perfect alignment

Final results: Train accuracy: 96.26% Val accuracy: 96.88% Test accuracy: 92.22% Gap: train − test = 4.04 pts (vs 11.78 pts for CNN Model 1) Improvement over CNN Model 1: +4.92 pts Improvement over best ANN: +14.89 pts

The training and validation curves run almost perfectly parallel — a hallmark of a well-regularised model. The gap between training and test accuracy has been cut from 11.78 to 4.04 points. That remaining gap is largely unavoidable: the test set contains genuinely harder images the model has never seen.

Validation accuracy (96.88%) slightly exceeding training accuracy (96.26%) is the Dropout effect again — the model is actually a bit held back during training but runs free at inference time.

CNN Model 2 · Results

Classification report & confusion matrix

Confusion matrix — almost entirely diagonal (correct predictions)

Digit Precision Recall F1 0 0.94 0.95 0.95 1 0.91 0.93 0.92 2 0.94 0.93 0.94 3 0.90 0.90 0.90 4 0.94 0.93 0.94 5 0.89 0.92 0.91 6 0.91 0.91 0.91 7 0.94 0.94 0.94 8 0.93 0.90 0.91 9 0.91 0.91 0.91 Overall accuracy: 92%

Compare these F1 scores to ANN Model 2: every single digit improved dramatically — from the 0.72–0.82 range to 0.90–0.95. The confusion matrix is almost entirely diagonal, meaning the model nearly always predicts the correct digit.

The hardest pairs (3 vs 8, 5 vs 6) still score slightly lower than the rest, but the gaps are now small. The spatial feature detection of CNNs — detecting the curved bottom of an 8 that a 3 doesn't have — directly addresses the root cause of those confusions.

Chapter 8

Preprocessing Experiment: Data-Driven Spatial Mask

A bonus investigation: can we help the model by mathematically dimming the edge pixels (where neighboring digits appear) before training?

The Experiment

Building a mask from the data itself

The data-driven spatial mask — brighter = higher weight, darker = suppressed

How the mask is built: 1. Background-normalize all images (69.4% of images were inverted) 2. Average all 42,000 training images 3. Normalize average to 0.0–1.0 → This IS the mask Mask statistics: Center weight (16,16): 0.8621 Corner weight (0, 0): 0.3345 Center/corner ratio: 2.6×

The mask is built entirely from the training data — no manual design, no geometric assumptions. By averaging all images, pixels that consistently carry digit information (the center) stay bright. Pixels dominated by neighboring digits or background (the edges) become dimmer.

The center pixel gets a weight of 0.86 while corner pixels get only 0.33 — the mask is 2.6× stronger in the center. When multiplied against input images, edge information is dampened before the network ever sees it.

The Result — Masking Made Things Worse

Raw inputs: 92.22% — Masked inputs: 87.11%

Side-by-side accuracy comparison: raw vs masked inputs

CNN Model 2 — Raw inputs: 92.22% CNN Model 2 — Masked inputs: 87.11% Delta: −5.11% Conclusion: the mask hurts performance.

The mask made things worse. Three reasons why:

1. It clips real signal. The mean image shows where digits appear on average — but 69.4% of images were already background-inverted before averaging. Individual digits still vary in position. The mask dims pixels that sometimes carry genuine information.

2. CNN Model 2 already learned this. Getting 92.22% on raw inputs means the conv filters independently developed center-focus during training. The mask makes explicit what the CNN already learned — just less precisely.

3. Preprocessing is irreversible. Once a pixel is multiplied by the mask weight, that information is gone permanently. A CNN trained on raw inputs can choose to ignore edge pixels case-by-case. The masked CNN can't choose to recover them.

Takeaway: for this architecture and dataset, the CNN is a better spatial filter than anything that can be designed manually. Raw inputs win.

Chapter 9

Full Model Comparison

All four models, side by side.

Section 7 · Model Comparison

All models ranked by test accuracy

Test accuracy across all four models

Model	Architecture	Train Acc	Val Acc	Test Acc	Key Characteristic
ANN Model 1	64→32→10	64.02%	63.86%	63.57%	Shallow baseline — no spatial awareness
ANN Model 2	256→128→64→32 + Dropout + BN	77.59%	79.73%	77.33%	Deeper — still no spatial structure
CNN Model 1	2 conv blocks	99.08%	95.17%	87.30%	Spatial features work — overfits without Dropout
CNN Model 2	4 conv blocks + BN + Dropout	96.26%	96.88%	92.22%	Best — depth + regularisation + clean gap

Chapter 10

Error Analysis — Understanding the Mistakes

Metrics tell us how many errors. Inspecting the actual wrong predictions tells us why — which is far more actionable.

Section 8 · Error Analysis

Misclassified images — what went wrong?

Sample predictions — correct ones and the one error in this batch (True: 2, Pred: 7, Confidence: 54%)

A selection of misclassified images — showing true label vs predicted label vs model confidence

Total misclassified: 1,401 / 18,000 (7.8% error rate) Correctly classified: 16,599 / 18,000 (92.2%) Sample error: True=2, Pred=7, Confidence=54% (low confidence = model was unsure, which is honest)

Of the 1,401 errors, most fall into two categories:

Genuinely ambiguous images — digits that look visually similar in real photos (1 vs 7, 3 vs 8, 5 vs 6). Even a human might struggle with some of these. The model's confusion is appropriate.

High-confidence wrong predictions — these are the most concerning. A model that predicts "1" with 92% confidence but the true label is "4" is being overconfident. These cases are worth investigating because they represent a failure of calibration, not just capacity.

The error analysis is not just a performance metric — it's a debugging tool that tells us what kinds of images to add or augment in future training runs to close specific gaps.

Chapter 11

Final Conclusions

Eight things this project proves, empirically

1. CNNs massively outperform ANNs (+14.89 pts on test accuracy). ANNs flatten the image and throw away all spatial structure before processing. CNNs keep it, using filters that detect edges, strokes, and curves. That difference alone accounts for most of the accuracy gap.

2. Deeper models with regularisation consistently win. In both families, the deeper regularised model beat the shallower one. Depth gives capacity; Dropout and BatchNorm make sure that capacity generalises rather than memorises.

3. CNN Model 1 showed exactly why Dropout in the dense head matters. Without it: 99.08% train, 87.30% test — an 11.78 point gap. Textbook overfitting. Adding Dropout(0.5) in Model 2's dense head cut that gap to 4.04 points.

4. LeakyReLU matters in conv layers. Standard ReLU permanently kills filters that produce negative pre-activations. LeakyReLU(0.1) keeps a 10% gradient so no filter ever gets stuck during training.

5. Use the provided validation set — don't cut into training data. The H5 file came with 60,000 dedicated validation images. Using validation_split=0.2 instead would have wasted 8,400 training samples for no reason.

6. A clever preprocessing mask is still worse than a well-trained CNN. The spatial mask scored 87.11% vs 92.22% for raw inputs — a −5.11% drop. CNN Model 2 already learns center-focus implicitly. Preprocessing is irreversible; neural networks learn.

7. Fewer parameters + better architecture > more parameters + poor design. CNN Model 2 (164K params) outperforms CNN Model 1 (267K params) by 4.92 points. Regularisation is more valuable than raw capacity.

8. The remaining errors make sense. 1,401 out of 18,000 wrong (7.8% error rate). Error analysis shows most wrong predictions are on genuinely hard images — the model fails on the cases where a human would also hesitate.

Best model: CNN Model 2 — 92.22% test accuracy One important caveat: this classifier reads one digit at a time from a pre-centered 32×32 crop. Reading a full house number like "128" in a real street photo would also require: 1. A digit detector to find and crop each digit 2. A sequence reconstructor to put them in order This notebook builds the classifier — the core piece — but not the full end-to-end pipeline.

SVHN Digit RecognitionEvery model, every chart, explained in plain English

The Problem & Dataset

What is SVHN and why does it matter?

What does the data look like?

A surprise hiding in the images

Exploratory Data Analysis (EDA)

Are all digits equally represented?

What does each digit actually look like?

Where does each digit's signal actually live in the image?

Data Preparation

Normalisation, reshaping, and one-hot encoding

ANN Model 1 — The Baseline

A small, shallow network — intentionally simple

Training curves — learning progress over 20 epochs

ANN Model 2 — Deeper with Regularisation

A deeper network with regularisation layers

Training curves — 30 epochs

Classification report & confusion matrix

CNN Model 1 — Spatial Features Enter the Picture

Two convolutional blocks + a dense head

Training curves reveal a classic overfitting problem

CNN Model 2 — The Best Model

Four convolutional blocks with full regularisation

Training curves — clean convergence, no overfitting

Classification report & confusion matrix

Preprocessing Experiment: Data-Driven Spatial Mask

Building a mask from the data itself

Raw inputs: 92.22% — Masked inputs: 87.11%

Full Model Comparison

All models ranked by test accuracy

Error Analysis — Understanding the Mistakes

Misclassified images — what went wrong?

Final Conclusions

Eight things this project proves, empirically

SVHN Digit Recognition
Every model, every chart, explained in plain English