The Anh Nguyen, Builder & Technical Lead

Case studies

SketchNet

CNN for hand-drawn sketches - 25 Quick Draw classes - trained from scratch

2026

Convolutional neural network for finger-drawn sketches. Four experiments to close the training-inference gap, most notably switching from pre-rendered 28×28 bitmaps to native stroke-vector rendering at 128×128. Built an X-Ray visualization to see what each layer learns. Live input via finger, mouse, or webcam through a Streamlit app.

95.1%

test accuracy

938 KB

model size

84 min

training time

1 ms

inference / sample

PyTorch NumPy MPS Streamlit MediaPipe

Read the case study -> Try live demo

DIA Risk Screener

Five algorithms, one molecule, the spread is the trust signal

2026

Five algorithms scoring the same molecule for drug-induced autoimmunity risk. The spread between their probabilities is the headline; the chemistry interpretation runs once and stays the same regardless of which model the user picks.

models

0.527

max spread

0.896

best test AUC

477

compounds

Python scikit-learn XGBoost RDKit Streamlit

Read the case study -> Try live demo

PCA Audio Toolkit

PCA-based audio denoising and lossy compression

2026

Two PCA pipelines on audio: lossy compression on raw waveform blocks, and denoising on STFT spectrograms. Without spectral subtraction enabled, stationary noise hides at the top of the variance ranking instead of the bottom, and the denoiser does almost nothing.

Python NumPy SciPy scikit-learn Streamlit

Read the case study -> Try live demo

Cutting Spark shuffle cost

Wide vs narrow transformations on billion-record pipelines

iPrice Group

Spark jobs that processed billions of records were dominated by shuffle cost on EMR. Pushed filters and projections before joins so data shrank before any wide transformation. Used broadcast joins for small dimension tables to skip shuffle entirely. Pre-partitioned hot datasets so repeated joins reused the layout instead of reshuffling.

Net result: faster jobs, smaller clusters, lower bills.

PySpark AWS EMR Apache Airflow Parquet S3

Expertise

Languages

Python TypeScript JavaScript PHP

AI / ML

PyTorch scikit-learn XGBoost CNNs LLM applications

Data Engineering

ETL Pipelines (Extract, Transform, Load) Apache Airflow PySpark

AWS Cloud

Athena EMR S3 SQS ElastiCache Lambda RDS Azure Data Warehouse

Databases

Elasticsearch MySQL PostgreSQL SQL Server Cassandra

Web & API

Laravel Symfony RESTful API GraphQL

DevOps & Infra

Docker Terraform ELK Stack

AWS

AWS Certified Solutions Architect - Associate

July 2022

About

15+ years engineering at scale. Pipelines, EMR clusters, Airflow DAGs, and the unglamorous work of cutting cloud costs from the inside. I profile actual bottlenecks before changing anything, and prefer durable fixes over clever ones.

Built teams too: grew my team at iPrice from 3 to 7, mentored two engineers who were later promoted.

Building AI applications now: LLM-powered automation in production, and learning ML by building (see SketchNet, DIA Risk Screener). Also travel and read.

"Great code is minimal to no code."

Focus AI engineering, agentic systems, data engineering, cloud infrastructure, backend systems, engineering leadership

Primary stack Python - TypeScript - AWS - Airflow - Elasticsearch

Certification AWS Certified Solutions Architect - Associate (2022)

GitHub github.com/theanh

LinkedIn linkedin.com/in/the-anh