The Anh Nguyen Try the live app ->

Field notes - Machine learning - April 2026

A Sketch Recognition Learning Journey

Training a convolutional neural network to recognise hand-drawn sketches. Four turning points that changed everything.

90.5%
95.1%
test accuracy
8 hours
84 min
training time
34 MB
938 KB
model size

The project started with Quick Draw's pre-rendered 28×28 bitmap images (Quick Draw is Google's crowd-sourced sketch dataset of 50 million drawings), the same format as MNIST (the standard handwritten-digit benchmark). Over several runs, better learning rate scheduling, more training data, and class-weighted loss moved accuracy from 90.5% to 92.6%. Then it stopped. Every new attempt landed within 0.1 percentage points of the same ceiling.

Four experiments later, the model hits 95.1% accuracy on Quick Draw, 78.6% on independent real-world drawings from the TU Berlin Sketch Benchmark (a separate dataset of tablet-drawn sketches), runs in just under one millisecond per sample on CPU, and weighs 938 kilobytes on disk. The path between those numbers is the interesting part.

88% 90% 92% 94% 96% gate 90.5% Baseline 28px - 1.6 MB 92.6% Ceiling 28px - best 94.6% 128px 8 h - 34 MB 95.3% sync fix 31 min - 34 MB 95.4% Quantized 938 KB INT8 95.1% Shipped + domain aug baseline
Each bar is a key milestone in the four-act journey, all measured on the Quick Draw test set. The blue bar on the right is the current shipping model after the domain-matching augmentation from Act IV: 0.28 percentage points lower on Quick Draw than the previous milestone, but +23 percentage points on independent TU Berlin sketches (78.6% vs 55.3%). The dashed red line marks the 94.5% ship gate.

I. The 28px Ceiling

Baseline -> best at 28 pixels

The obvious fix to the 92.6% plateau was to upscale the bitmaps to 64 pixels and give the model more spatial room. That attempt took eleven hours on CPU and came back worse at 90.4%. The reason was straightforward in hindsight: training data was blurry 28-pixel bitmaps bicubically stretched to 64, while a live canvas produces sharp 480-pixel drawings downscaled to 64. The model had learned textures that did not exist at inference time.

The limit was not the architecture or the training procedure. Moon and circle share the same round shape. At 28 pixels, distinguishing them requires differences measured in single pixels. That feature does not exist at that resolution, and no amount of augmentation can create information that was never there.

II. Resolution and Speed

128px native rendering, then MPS training

Quick Draw distributes the original stroke vectors in Newline Delimited JSON (NDJSON) format, not just the pre-rendered bitmaps. Rendering those strokes directly at 128×128 produces images with the same sharp line quality as a live canvas draw. The texture gap that killed the upscaling attempt disappears entirely, because training and inference now use the same rendering pipeline.

PRE-RENDERED 28×28 BITMAP upscaled to 128×128 (bilinear) NDJSON STROKE COORDINATES rendered at 128×128 (cv2.polylines) BLURRY EDGES, FAKE TEXTURES SHARP LINES, MATCHES CANVAS low-resolution source, interpolated up vector source, rasterized at target size
The same Quick Draw apple, two training pipelines. Pre-rasterized 28×28 bitmaps carry their original aliasing into training and emerge blurry after bilinear upscaling. Native rendering of the underlying stroke coordinates produces images with the same line quality as a live canvas draw, eliminating the texture mismatch that killed the upscaling attempt.

The jump was immediate: accuracy went from 92.6% to 94.6%, clearing the 28-pixel ceiling that had been the limit for months. The model now had enough resolution to tell apart moon from circle, sun from spider, cat from dog.

The cost was training time. Eight hours on CPU per run. Moving to Apple Silicon's Metal Performance Shaders (MPS) GPU was the obvious next step, but the first attempt produced a confusing result: epochs were fast at first, then each one took progressively longer. A run projected to finish in 30 minutes stretched past five hours.

After the sync fix: consistent 150 seconds per epoch, 11 epochs with early stopping, 31 minutes total. Accuracy reached 95.3%. A 15× speedup from eliminating queue pollution, not from raw compute.

Training time

128px - CPU 8 h sync fix 31 min MPS GPU <- speed breakthrough (15×) bottleneck 53 min MPS GPU domain aug 84 min MPS GPU <- current shipping
Speed across the journey, top to bottom in chronological order. The sync fix delivers the 15× breakthrough (8 h -> 31 min). Adding the 1×1 bottleneck and then the wobble/morphology augmentation traded some of that win back, ending at 84 minutes for the current shipping run.
95.3%
test accuracy
31 min
training time
34 MB
model size still a problem

III. 37× Smaller

Architecture, then quantization

A 34 MB model is a real deployment problem on a memory-constrained server. Profiling showed why it was so large: a single fully connected layer projected the flattened 8×8 feature map (16,384 values) to 512 dimensions. That one layer held 95% of all the model parameters. Everything else was rounding error.

The first idea to fix it was Global Average Pooling (GAP), which collapses the spatial grid into a single vector by averaging across positions. Two separate attempts at this both failed at around 91%, a 4 percentage point drop. The reason was visible in the per-class breakdown: sun fell to 69%, moon and bird also suffered badly. These classes rely on where features appear, not just whether they exist. A sun has radial lines emanating from a center point. A bird has wings on the sides. GAP averages all positions and destroys that information.

DEEP FEATURE MAP - 8×8 SPATIAL × 256 CHANNELS example: a 'sun' channel - bright center, radial spokes, dim corners two ways to shrink it before the dense layer GLOBAL AVERAGE POOLING mean of all 64 positions per channel 256 × 1 × 1 = 256 values "Is there a sun?" Yes, kind of. Where? Forgotten. SUN 69%, HEAD STUCK AT 91% 1×1 CONVOLUTION BOTTLENECK 256 -> 64 channels, spatial untouched 64 × 8 × 8 = 4,096 values SPATIAL PATTERN KEPT, HEAD HITS 95.4%
Same input feature map, two ways to feed it to the dense layer. GAP averages all 64 positions per channel into one number, throwing away the spatial pattern that classes like sun (radial), moon (crescent), and bird (lateral wings) depend on. The 1×1 bottleneck only shrinks the channel dimension - the 8×8 grid passes through untouched, so "where each feature fires" is preserved. Both methods deliver a fixed-size vector to the FC layer; only the bottleneck's vector still carries spatial information.

A 1×1 convolution bottleneck worked where GAP failed. It reduces the 256 feature channels down to 64 while keeping the 8×8 spatial grid completely intact. The fully connected layer then sees 4,096 values instead of 16,384. Parameter count fell from 8.8 million to 933,000. File size went from 34 MB to 3.75 MB. Accuracy stayed at 95.4%.

Post-training static quantization brought it the rest of the way. The process fuses each Convolutional, Batch Normalization, and Rectified Linear Unit triplet into a single operation, then runs a 1,010-sample calibration pass with worst-case classes over-represented to learn the scale of each layer's activations. Weights convert from 32-bit floats to 8-bit integers. Result: 3.75 MB to 938 KB. Accuracy: 95.35% before quantization, 95.36% after. The quantization cost nothing.

Model file size

FP32 model 34 MB bottleneck FP32 3.75 MB INT8 quantized 938 KB <- 97% smaller than original
Size before and after. The 34 MB FP32 baseline shrinks to 3.75 MB after the 1×1 bottleneck head, then to 938 KB after INT8 quantization - 97% smaller than the original, with no accuracy cost.
95.36%
INT8 accuracy
938 KB
model size
1.03 ms
CPU latency per sample

IV. The Test Set Was Lying

Real world validation

With 95.36% on the Quick Draw test set, the model looked finished. A spot check against tablet-drawn sketches from the TU Berlin dataset told a different story: accuracy on those independent drawings was around 55%. The same model, evaluated on a harder distribution, failed more than it passed.

The gap made sense. Quick Draw strokes are clean, thin, and drawn at a consistent speed with a mouse or touchpad. Real canvas sketches have visible wobble from hand tremor, varying line weight from pen pressure, and thicker strokes that sometimes bleed at corners. The model had been trained and tested on the same clean distribution. Of course it scored well on the test set.

Closing the gap meant making training data look more like real drawings. The stroke rendering pipeline was extended with three changes: stroke widths randomised between 3 and 8 pixels to match the 7-pixel canvas brush, a cumulative Gaussian wobble applied perpendicular to each stroke tangent to simulate hand tremor, and a small random morphological dilation or erosion as a proxy for pen pressure variation.

QUICK DRAW (CLEAN) as-is from NDJSON, single stroke width + WOBBLE cumulative Gaussian, σ=1.2 px too tidy - doesn't match real canvas simulates hand tremor + MORPHOLOGY ±1 px dilate / erode (p=0.3) + WOBBLE + MORPHOLOGY current Exp H training data simulates pen-pressure variation closes the canvas distribution gap
The same Quick Draw apple under each step of the Exp H render-time augmentation. Clean strokes (top-left) are too tidy to match what users actually draw. Wobble (top-right) adds the natural micro-shake of a hand. Morphology (bottom-left, dilation shown; the real pipeline picks dilate or erode 50/50) approximates pen-pressure variation. Combined (bottom-right) is what the model now trains on - and it's why Exp H gained 23 percentage points on TU Berlin sketches even though it lost 0.28pp on the cleaner Quick Draw test set.

On the Quick Draw test set, accuracy dropped from 95.36% to 95.08%. That 0.28 percentage point drop looked like a regression. On 1,760 independent tablet-drawn sketches from TU Berlin, accuracy went from 55.3% to 78.6%, a 23 percentage point gain. Dog improved by 50 points. Butterfly by 42 points. Guitar by 42 points. The model that appeared to regress was the model that actually worked in the real world.

The trade-off looks asymmetric at the confusion-pair level. Wobble helped pairs where the distinguishing feature was structurally robust to noise, and slightly hurt pairs where the feature was already fragile at 128 px:

Confusion pair (Quick Draw test set) Exp G Exp H Δ
sun -> spider 138 98 −40 (−29%)
moon -> circle 227 239 +12 (+5%)

Sun's radial rays still read as radial when jittered, so wobble exposed the model to more variants of the same structure. Moon's crescent at 128 px was already a borderline case against a circle; wobble pushed a few more samples across that line. The Quick Draw dip is the net of these opposing shifts - paid on some pairs to win on the broader distribution.

95.1%
Quick Draw test
78.6%
TU Berlin (+23pp)
938 KB
shipped model size

Lessons learned

  1. Render native, not upscaled. Upscaling a low-resolution training image creates a texture gap between blurry training data and sharp inference inputs. The right fix is to render at the target resolution from the original source, in this case raw stroke coordinates, not to interpolate bitmaps that were never high quality to begin with.
  2. Profile before optimising. The GPU bottleneck was not the 8.4 million parameter matrix multiplication that was blamed going in. It was 781 per-batch .item() calls forcing host/device queue drains, with latency compounding nonlinearly. A single torch.mps.synchronize() call between stages exposed it. The wrong diagnosis would have led to the wrong fix and cost weeks.
  3. Global Average Pooling loses location. Global Average Pooling collapses the entire spatial grid into a single vector, discarding positional information. For sketch recognition, where features like the location of a wing or the center of a radial burst are the signal, that information loss is fatal. A 1×1 convolution bottleneck shrinks channel depth without touching the spatial grid, and it works.
  4. Your test set might be the wrong metric. A clean benchmark hides generalization gaps when training and evaluation share the same distribution. The augmentation that lost 0.28 percentage points on Quick Draw bought 23 percentage points on independent real-world drawings. External validation on a different dataset is the only honest measure of whether a model is actually ready.

The model in action

Two interaction modes are live in the demo app. The first lets you draw with mouse or finger on a canvas and watch the top-3 predictions update in real time. The second uses the camera: point at a sketch on paper and the same model classifies it from a still frame.

Canvas drawing mode. Sketch in the browser and watch the model predict in real time.
Canvas, draw in browser
Camera mode. Point at a hand-drawn sketch and get a live prediction.
Camera, point at a sketch

Open the live app ->

Final scorecard, shipping model

All gates pass

Check Gate Actual
Overall accuracy (8-bit integer weights) ≥ 94.5% 95.08%
Worst single class accuracy ≥ 82.5% 84.45% (moon)
Train and inference accuracy gap ≤ 1% 0.0%
Accuracy loss from quantization ≤ 1pp per class −0.75pp (dog)
CPU inference latency < 50 ms 0.98 ms per sample
Model file size < 1 MB 938 KB
External validation vs previous model improve on TU Berlin 78.58% (+23.3pp)

Architecture SketchNet4PoolBottleneck. Four convolutional blocks (32, 64, 128, 256 filters) on 128×128 input, a 1×1 convolution bottleneck reducing 256 channels to 64 while preserving the 8×8 spatial grid, flatten to 4,096, Linear(128) with Dropout(0.3), Linear(25 classes). 933K parameters in 32-bit floating point. Quantized to 8-bit integers using post-training static quantization. All training from scratch. No pretrained weights, no fine tuning.