Builder & Technical Lead - AI / Data Engineering - 15+ years

Engineering at scale.

I optimize data systems that handle billions of records, cut infrastructure costs, and actually last. Hands-on from product discovery, architecture, implementation to deployment.

$6K+
monthly infra savings
6B
records at peak
20M+
monthly visitors served
15+
years of experience

SketchNet

CNN for hand-drawn sketches - 25 Quick Draw classes - trained from scratch
2026

Convolutional neural network for finger-drawn sketches. Four experiments to close the training-inference gap, most notably switching from pre-rendered 28×28 bitmaps to native stroke-vector rendering at 128×128. Built an X-Ray visualization to see what each layer learns. Live input via finger, mouse, or webcam through a Streamlit app.

95.1%
test accuracy
938 KB
model size
84 min
training time
1 ms
inference / sample
PyTorch NumPy MPS Streamlit MediaPipe

DIA Risk Screener

Five algorithms, one molecule, the spread is the trust signal
2026

Five algorithms scoring the same molecule for drug-induced autoimmunity risk. The spread between their probabilities is the headline; the chemistry interpretation runs once and stays the same regardless of which model the user picks.

5
models
0.527
max spread
0.896
best test AUC
477
compounds
Python scikit-learn XGBoost RDKit Streamlit

PCA Audio Toolkit

PCA-based audio denoising and lossy compression
2026

Two PCA pipelines on audio: lossy compression on raw waveform blocks, and denoising on STFT spectrograms. Without spectral subtraction enabled, stationary noise hides at the top of the variance ranking instead of the bottom, and the denoiser does almost nothing.

Python NumPy SciPy scikit-learn Streamlit

Cutting Spark shuffle cost

Wide vs narrow transformations on billion-record pipelines
iPrice Group

Spark jobs that processed billions of records were dominated by shuffle cost on EMR. Pushed filters and projections before joins so data shrank before any wide transformation. Used broadcast joins for small dimension tables to skip shuffle entirely. Pre-partitioned hot datasets so repeated joins reused the layout instead of reshuffling.

Net result: faster jobs, smaller clusters, lower bills.

PySpark AWS EMR Apache Airflow Parquet S3

Languages

Python TypeScript JavaScript PHP

AI / ML

PyTorch scikit-learn XGBoost CNNs LLM applications

Data Engineering

ETL Pipelines (Extract, Transform, Load) Apache Airflow PySpark

AWS Cloud

Athena EMR S3 SQS ElastiCache Lambda RDS Azure Data Warehouse

Databases

Elasticsearch MySQL PostgreSQL SQL Server Cassandra

Web & API

Laravel Symfony RESTful API GraphQL

DevOps & Infra

Docker Terraform ELK Stack
AWS
AWS Certified Solutions Architect - Associate
July 2022

15+ years engineering at scale. Pipelines, EMR clusters, Airflow DAGs, and the unglamorous work of cutting cloud costs from the inside. I profile actual bottlenecks before changing anything, and prefer durable fixes over clever ones.

Built teams too: grew my team at iPrice from 3 to 7, mentored two engineers who were later promoted.

Building AI applications now: LLM-powered automation in production, and learning ML by building (see SketchNet, DIA Risk Screener). Also travel and read.

"Great code is minimal to no code."
Focus AI engineering, agentic systems, data engineering, cloud infrastructure, backend systems, engineering leadership
Primary stack Python - TypeScript - AWS - Airflow - Elasticsearch
Certification AWS Certified Solutions Architect - Associate (2022)