Field notes - Machine learning - April 2026

A Sketch Recognition Learning Journey

Training a convolutional neural network to recognise hand-drawn sketches. Four turning points that changed everything.

By The Anh Nguyen - CNN from scratch - 25 classes - Quick Draw + TU Berlin

90.5%

95.1%

test accuracy

8 hours

84 min

training time

34 MB

938 KB

model size

The project started with Quick Draw's pre-rendered 28×28 bitmap images (Quick Draw is Google's crowd-sourced sketch dataset of 50 million drawings), the same format as MNIST (the standard handwritten-digit benchmark). Over several runs, better learning rate scheduling, more training data, and class-weighted loss moved accuracy from 90.5% to 92.6%. Then it stopped. Every new attempt landed within 0.1 percentage points of the same ceiling.

Four experiments later, the model hits 95.1% accuracy on Quick Draw, 78.6% on independent real-world drawings from the TU Berlin Sketch Benchmark (a separate dataset of tablet-drawn sketches), runs in just under one millisecond per sample on CPU, and weighs 938 kilobytes on disk. The path between those numbers is the interesting part.

Each bar is a key milestone in the four-act journey, all measured on the Quick Draw test set. The blue bar on the right is the current shipping model after the domain-matching augmentation from Act IV: 0.28 percentage points lower on Quick Draw than the previous milestone, but +23 percentage points on independent TU Berlin sketches (78.6% vs 55.3%). The dashed red line marks the 94.5% ship gate.

I. The 28px Ceiling

Baseline -> best at 28 pixels

The obvious fix to the 92.6% plateau was to upscale the bitmaps to 64 pixels and give the model more spatial room. That attempt took eleven hours on CPU and came back worse at 90.4%. The reason was straightforward in hindsight: training data was blurry 28-pixel bitmaps bicubically stretched to 64, while a live canvas produces sharp 480-pixel drawings downscaled to 64. The model had learned textures that did not exist at inference time.

The limit was not the architecture or the training procedure. Moon and circle share the same round shape. At 28 pixels, distinguishing them requires differences measured in single pixels. That feature does not exist at that resolution, and no amount of augmentation can create information that was never there.

II. Resolution and Speed

128px native rendering, then MPS training

Quick Draw distributes the original stroke vectors in Newline Delimited JSON (NDJSON) format, not just the pre-rendered bitmaps. Rendering those strokes directly at 128×128 produces images with the same sharp line quality as a live canvas draw. The texture gap that killed the upscaling attempt disappears entirely, because training and inference now use the same rendering pipeline.

The same Quick Draw apple, two training pipelines. Pre-rasterized 28×28 bitmaps carry their original aliasing into training and emerge blurry after bilinear upscaling. Native rendering of the underlying stroke coordinates produces images with the same line quality as a live canvas draw, eliminating the texture mismatch that killed the upscaling attempt.

The jump was immediate: accuracy went from 92.6% to 94.6%, clearing the 28-pixel ceiling that had been the limit for months. The model now had enough resolution to tell apart moon from circle, sun from spider, cat from dog.

The cost was training time. Eight hours on CPU per run. Moving to Apple Silicon's Metal Performance Shaders (MPS) GPU was the obvious next step, but the first attempt produced a confusing result: epochs were fast at first, then each one took progressively longer. A run projected to finish in 30 minutes stretched past five hours.

Unexpected finding

The original diagnosis blamed the first fully connected layer's 8.4 million parameter matrix multiplication as the bottleneck. Profiling with explicit torch.mps.synchronize() calls between stages revealed the real cause: per-batch .item() calls were forcing 781 host/device queue drains per epoch, and the drain time grew nonlinearly as session state accumulated. The fix was to accumulate the loss and correct prediction count as device-side tensors throughout the epoch, then call .item() exactly once at the end. The right hypothesis would have led to the wrong fix entirely.

Worth noting: applying this fix while staying on CPU would have changed nothing. On CPU, .item() is synchronous by nature, there is no separate device queue to drain, and the data is already in host memory. The exponential slowdown was a GPU-specific problem. The sync fix and the GPU switch only work together.

After the sync fix: consistent 150 seconds per epoch, 11 epochs with early stopping, 31 minutes total. Accuracy reached 95.3%. A 15× speedup from eliminating queue pollution, not from raw compute.

Training time

Speed across the journey, top to bottom in chronological order. The sync fix delivers the 15× breakthrough (8 h -> 31 min). Adding the 1×1 bottleneck and then the wobble/morphology augmentation traded some of that win back, ending at 84 minutes for the current shipping run.

95.3%

test accuracy

31 min

training time

34 MB

model size still a problem

III. 37× Smaller

Architecture, then quantization

A 34 MB model is a real deployment problem on a memory-constrained server. Profiling showed why it was so large: a single fully connected layer projected the flattened 8×8 feature map (16,384 values) to 512 dimensions. That one layer held 95% of all the model parameters. Everything else was rounding error.

The first idea to fix it was Global Average Pooling (GAP), which collapses the spatial grid into a single vector by averaging across positions. Two separate attempts at this both failed at around 91%, a 4 percentage point drop. The reason was visible in the per-class breakdown: sun fell to 69%, moon and bird also suffered badly. These classes rely on where features appear, not just whether they exist. A sun has radial lines emanating from a center point. A bird has wings on the sides. GAP averages all positions and destroys that information.

Same input feature map, two ways to feed it to the dense layer. GAP averages all 64 positions per channel into one number, throwing away the spatial pattern that classes like sun (radial), moon (crescent), and bird (lateral wings) depend on. The 1×1 bottleneck only shrinks the channel dimension - the 8×8 grid passes through untouched, so "where each feature fires" is preserved. Both methods deliver a fixed-size vector to the FC layer; only the bottleneck's vector still carries spatial information.

A 1×1 convolution bottleneck worked where GAP failed. It reduces the 256 feature channels down to 64 while keeping the 8×8 spatial grid completely intact. The fully connected layer then sees 4,096 values instead of 16,384. Parameter count fell from 8.8 million to 933,000. File size went from 34 MB to 3.75 MB. Accuracy stayed at 95.4%.

The arithmetic

The 128×128 input image carries 16,384 pixels. Four conv-pool blocks rearrange those into an 8 × 8 × 256 = 16,384-value deep feature map - same total, just redistributed from many spatial positions and one channel into few positions and many channels.

Without a bottleneck, FC1 has to flatten that 8×8×256 grid and project it down to 512 units: 16,384 × 512 ≈ 8.4 million parameters in this single layer alone, more than 95% of the entire model. Everything else - four conv blocks, batch norms, FC2 - was rounding error.

The 1×1 bottleneck inserts one cheap layer (256 × 64 = 16K parameters) that compresses channel depth before the flatten. FC1 now projects 4,096 -> 128: 4,096 × 128 ≈ 524K parameters - a 16× cut on the layer that dominated the budget. The conv stack is unchanged. Total model: 8.8M -> 933K parameters, a 9.4× shrink from one architectural insertion.

Post-training static quantization brought it the rest of the way. The process fuses each Convolutional, Batch Normalization, and Rectified Linear Unit triplet into a single operation, then runs a 1,010-sample calibration pass with worst-case classes over-represented to learn the scale of each layer's activations. Weights convert from 32-bit floats to 8-bit integers. Result: 3.75 MB to 938 KB. Accuracy: 95.35% before quantization, 95.36% after. The quantization cost nothing.

Model file size

Size before and after. The 34 MB FP32 baseline shrinks to 3.75 MB after the 1×1 bottleneck head, then to 938 KB after INT8 quantization - 97% smaller than the original, with no accuracy cost.

95.36%

INT8 accuracy

938 KB

model size

1.03 ms

CPU latency per sample

IV. The Test Set Was Lying

Real world validation

With 95.36% on the Quick Draw test set, the model looked finished. A spot check against tablet-drawn sketches from the TU Berlin dataset told a different story: accuracy on those independent drawings was around 55%. The same model, evaluated on a harder distribution, failed more than it passed.

The gap made sense. Quick Draw strokes are clean, thin, and drawn at a consistent speed with a mouse or touchpad. Real canvas sketches have visible wobble from hand tremor, varying line weight from pen pressure, and thicker strokes that sometimes bleed at corners. The model had been trained and tested on the same clean distribution. Of course it scored well on the test set.

Closing the gap meant making training data look more like real drawings. The stroke rendering pipeline was extended with three changes: stroke widths randomised between 3 and 8 pixels to match the 7-pixel canvas brush, a cumulative Gaussian wobble applied perpendicular to each stroke tangent to simulate hand tremor, and a small random morphological dilation or erosion as a proxy for pen pressure variation.

The same Quick Draw apple under each step of the Exp H render-time augmentation. Clean strokes (top-left) are too tidy to match what users actually draw. Wobble (top-right) adds the natural micro-shake of a hand. Morphology (bottom-left, dilation shown; the real pipeline picks dilate or erode 50/50) approximates pen-pressure variation. Combined (bottom-right) is what the model now trains on - and it's why Exp H gained 23 percentage points on TU Berlin sketches even though it lost 0.28pp on the cleaner Quick Draw test set.

On the Quick Draw test set, accuracy dropped from 95.36% to 95.08%. That 0.28 percentage point drop looked like a regression. On 1,760 independent tablet-drawn sketches from TU Berlin, accuracy went from 55.3% to 78.6%, a 23 percentage point gain. Dog improved by 50 points. Butterfly by 42 points. Guitar by 42 points. The model that appeared to regress was the model that actually worked in the real world.

The trade-off looks asymmetric at the confusion-pair level. Wobble helped pairs where the distinguishing feature was structurally robust to noise, and slightly hurt pairs where the feature was already fragile at 128 px:

Confusion pair (Quick Draw test set)	Exp G	Exp H	Δ
sun -> spider	138	98	−40 (−29%)
moon -> circle	227	239	+12 (+5%)

Sun's radial rays still read as radial when jittered, so wobble exposed the model to more variants of the same structure. Moon's crescent at 128 px was already a borderline case against a circle; wobble pushed a few more samples across that line. The Quick Draw dip is the net of these opposing shifts - paid on some pairs to win on the broader distribution.

95.1%

Quick Draw test

78.6%

TU Berlin (+23pp)

938 KB

shipped model size

Lessons learned

Render native, not upscaled. Upscaling a low-resolution training image creates a texture gap between blurry training data and sharp inference inputs. The right fix is to render at the target resolution from the original source, in this case raw stroke coordinates, not to interpolate bitmaps that were never high quality to begin with.
Profile before optimising. The GPU bottleneck was not the 8.4 million parameter matrix multiplication that was blamed going in. It was 781 per-batch .item() calls forcing host/device queue drains, with latency compounding nonlinearly. A single torch.mps.synchronize() call between stages exposed it. The wrong diagnosis would have led to the wrong fix and cost weeks.
Global Average Pooling loses location. Global Average Pooling collapses the entire spatial grid into a single vector, discarding positional information. For sketch recognition, where features like the location of a wing or the center of a radial burst are the signal, that information loss is fatal. A 1×1 convolution bottleneck shrinks channel depth without touching the spatial grid, and it works.
Your test set might be the wrong metric. A clean benchmark hides generalization gaps when training and evaluation share the same distribution. The augmentation that lost 0.28 percentage points on Quick Draw bought 23 percentage points on independent real-world drawings. External validation on a different dataset is the only honest measure of whether a model is actually ready.

The model in action

Two interaction modes are live in the demo app. The first lets you draw with mouse or finger on a canvas and watch the top-3 predictions update in real time. The second uses the camera: point at a sketch on paper and the same model classifies it from a still frame.

Canvas drawing mode. Sketch in the browser and watch the model predict in real time. — Canvas, draw in browser

Camera mode. Point at a hand-drawn sketch and get a live prediction. — Camera, point at a sketch

Open the live app ->

Final scorecard, shipping model

All gates pass

Check	Gate	Actual
Overall accuracy (8-bit integer weights)	≥ 94.5%	95.08%	✓
Worst single class accuracy	≥ 82.5%	84.45% (moon)	✓
Train and inference accuracy gap	≤ 1%	0.0%	✓
Accuracy loss from quantization	≤ 1pp per class	−0.75pp (dog)	✓
CPU inference latency	< 50 ms	0.98 ms per sample	✓
Model file size	< 1 MB	938 KB	✓
External validation vs previous model	improve on TU Berlin	78.58% (+23.3pp)	✓

Architecture SketchNet4PoolBottleneck. Four convolutional blocks (32, 64, 128, 256 filters) on 128×128 input, a 1×1 convolution bottleneck reducing 256 channels to 64 while preserving the 8×8 spatial grid, flatten to 4,096, Linear(128) with Dropout(0.3), Linear(25 classes). 933K parameters in 32-bit floating point. Quantized to 8-bit integers using post-training static quantization. All training from scratch. No pretrained weights, no fine tuning.