ML System Design

Production-grade machine learning system designs covering end-to-end architecture, scalability, and real-world engineering trade-offs. Learn how to build ML systems that serve millions of users.

Start Here

The posts that cover the decisions that matter most in production:

Batch vs Real-Time Inference — The most important architectural decision in any ML serving system.
LLM Serving — KV cache, batching strategies, and why LLM serving is different from classical model serving.
Speculative decoding meets 4-bit quantization — Why the combination outperforms either optimization alone.
Gemma 4: three architectural decisions that changed what a small model can do — Hybrid attention, native multimodality, MoE at 12% compute cost.
AI harness engineering: what the top labs get right — Why benchmark scores lie and how to build evals that predict production quality.

Each design includes:

Functional and non-functional requirements
High-level architecture with diagrams
Component deep-dives and implementation details
Scaling and optimization strategies
Monitoring, evaluation, and failure handling
Trade-off analysis and decision frameworks

Browse by Domain

Recommendation Systems:

Classification Systems:

Data Infrastructure:

Experimentation & Metrics:

A/B Testing Systems

Model Serving:

Batch vs Real-Time Inference

Model Evaluation:

Model Evaluation Metrics

Feature Engineering:

Feature Engineering at Scale

Model Deployment:

Model Serving Architecture

Model Training:

Recommendations:

Computer Vision:

Boundary Detection in ML

Deep Learning:

Sequence Modeling

MLOps:

Search & Retrieval:

Trie-based Search Systems

Natural Language Processing:

Data Augmentation:

Data Augmentation Pipeline

MLOps & Experiment Tracking:

AutoML & Model Design:

Infrastructure:

Model Optimization:

Model Ensembling

Unsupervised Learning:

Clustering Systems

Real-Time Systems:

Search & Ranking:

Coming soon…

Computer Vision:

Coming soon…

Natural Language Processing:

Real-Time Systems:

Coming soon…

Feature Engineering & Stores:

Coming soon…

Model Serving & Deployment:

High-Performance ML Inference Optimization

System Design Index

Below you’ll find all ML system design problems in chronological order:

Content created with the assistance of large language models and reviewed for technical accuracy.

Recommendation System: Candidate Retrieval

30 minute read

How do you narrow down 10 million items to 1000 candidates in under 50ms? The art of fast retrieval at scale.

Classification Pipeline Design

19 minute read

From raw data to production predictions: building a classification pipeline that handles millions of requests with 99.9% uptime.

Data Preprocessing Pipeline Design

30 minute read

How to build production-grade pipelines that clean, transform, and validate billions of data points before training.

A/B Testing Systems for ML

30 minute read

How to design experimentation platforms that enable rapid iteration while maintaining statistical rigor at scale.

Batch vs Real-Time Inference

25 minute read

How to choose between batch and real-time inference, the architectural decision that shapes your entire ML serving infrastructure.

Model Evaluation Metrics

25 minute read

How to measure if your ML model is actually good, choosing the right metrics is as important as building the model itself.

Feature Engineering at Scale

23 minute read

Feature engineering makes or breaks ML models, learn how to build scalable, production-ready feature pipelines that power real-world systems.

Model Serving Architecture

24 minute read

Design production-grade model serving systems that deliver predictions at scale with low latency and high reliability.

Online Learning Systems

26 minute read

Design systems that learn continuously from streaming data, adapting to changing patterns without full retraining.

Caching Strategies for ML Systems

28 minute read

Design efficient caching layers for ML systems to reduce latency, save compute costs, and improve user experience at scale.

Content Delivery Networks (CDN)

24 minute read

Design a global CDN for ML systems: Edge caching reduces latency from 500ms to 50ms. Critical for real-time predictions worldwide.

Distributed ML Systems

27 minute read

Design distributed ML systems that scale to billions of predictions: Master replication, sharding, consensus, and fault tolerance for production ML.

Resource Allocation for ML

30 minute read

Build production ML infrastructure that dynamically allocates resources using greedy optimization to maximize throughput and minimize costs.

Model Ensembling

27 minute read

Build production ensemble systems that combine multiple models using backtracking strategies to explore optimal combinations.

Clustering Systems

27 minute read

Design production clustering systems that group similar items using hash-based and distance-based approaches for recommendations, search, and analytics.

Event Stream Processing

21 minute read

Build production event stream processing systems that handle millions of events per second using windowing and temporal aggregation, applying the same interv...

Distributed Training Architecture

14 minute read

Design distributed training architectures that can efficiently process massive sequential datasets and train billion-parameter models across thousands of GPUs.

Data Augmentation Pipeline

14 minute read

Design a robust data augmentation pipeline that applies rich transformations to large-scale datasets without becoming the training bottleneck.

Experiment Tracking Systems

16 minute read

Design robust experiment tracking systems that enable systematic exploration, reproducibility, and collaboration across large ML teams.

Online Learning Systems

20 minute read

Design online learning systems that adapt models in real-time using greedy updates, the same adaptive decision-making pattern from Jump Game applied to strea...

Neural Architecture Search

20 minute read

Design neural architecture search systems that automatically discover optimal model architectures using dynamic programming and path optimization, the same p...

Cost Optimization for ML

18 minute read

A comprehensive guide to FinOps for Machine Learning: reducing TCO without compromising accuracy or latency.

Beam Search Decoding

16 minute read

The industry-standard algorithm for converting probabilistic model outputs into coherent text sequences.

Tokenization Systems

18 minute read

The critical preprocessing step that defines the vocabulary and capabilities of Large Language Models.

Model Monitoring Systems

17 minute read

The silent killer of ML models is not a bug in the code, but a change in the world.

Batch Processing Pipelines

14 minute read

Not everything needs to be real-time. Sometimes, “tomorrow morning” is fast enough.

Model Architecture Design

25 minute read

Architecture is destiny. The difference between 50% accuracy and 90% accuracy is often just a skip connection.

Ranking Systems at Scale

25 minute read

How does Google search 50 billion pages in 0.1 seconds? The answer is the “Ranking Funnel”.

Hierarchical Classification Systems

13 minute read

“Organizing the world’s information into a structured hierarchy.”

Graph-based Recommendation Systems

15 minute read

“Leveraging the connection structure to predict what users will love.”

ML Pipeline Dependencies & Orchestration

13 minute read

“Managing complex ML workflows with thousands of interdependent tasks.”

Semantic Search Systems

13 minute read

“Moving beyond keywords to understand the meaning of a query.”

Model Replication Systems

21 minute read

“Ensuring your ML models are available everywhere, all the time.”

Knowledge Graph Systems

24 minute read

“Structuring the world’s information into connected entities and relationships.”

Boundary Detection in ML

21 minute read

“Defining where one object ends and another begins.”

Resource Partitioning in ML Clusters

20 minute read

“How to share a supercomputer without stepping on each other’s toes.”

Sequence Modeling in ML

21 minute read

“Predicting the next word, the next stock price, the next frame.”

Hyperparameter Optimization

24 minute read

“Finding the perfect knobs to turn.”

Model Interpretability and Explainability (XAI)

18 minute read

“Trust, but verify. Why did the model say No?”

Distributed Training Patterns

23 minute read

“Scaling from one GPU to thousands.”

Model Compression Techniques

22 minute read

“Fitting billion-parameter models into megabytes.”

Feature Stores

17 minute read

“The centralized truth for machine learning features.”

Vector Databases

15 minute read

“The infrastructure for semantic search and AI-native applications.”

LLM Serving Infrastructure

14 minute read

“Serving models that think at human scale.”

RAG Systems

15 minute read

“Grounding LLMs in facts, not hallucinations.”

Transfer Learning Systems

10 minute read

“Standing on the shoulders of giants isn’t just a metaphor, it’s an engineering requirement.”

Model Serialization Systems

8 minute read

“Training is Art. Serialization is Logistics. Wars are won on logistics.”

Trie-Based Search Systems (Typeahead)

7 minute read

“The user knows what they want. Your job is to tell them before they finish typing.”

DAG Pipeline Orchestration

7 minute read

“Cron is not an orchestrator. A script is not a pipeline.”

Character-Level Language Models

7 minute read

“Before machines could write essays, they had to learn to spell.”

Federated Learning

23 minute read

“If data can’t move, move the model, and design the system so the server never sees what matters.”

Anomaly Detection

23 minute read

“Anomaly detection is trapping rain water for metrics: find the boundaries of ‘normal’ and measure what overflows.”

Data Validation

24 minute read

“Most ML failures aren’t model bugs, they’re invalid data quietly passing through.”

Pattern Matching in ML

24 minute read

“Most ML pipelines are quietly powered by pattern matching, rules, validators, and weak labels before the model ever trains.”

AutoML Systems

19 minute read

“The best algorithm is the one you didn’t have to tune by hand. AutoML is about moving the engineer from ‘writing code’ to ‘writing the objective function’.”

Real-time Personalization

20 minute read

“Generalization is the goal of ML, but Personalization is the goal of Products. Real-time personalization is about capturing the intent of the ‘now’.”

ML Capacity Planning and Infrastructure Scaling

5 minute read

“Capacity Planning is the art of predicting the future while paying for the present. In ML, it is the difference between a high-growth product and a bankrupt...

Advanced NLP Pipelines at Scale

6 minute read

“An NLP pipeline is a factory for meaning. It takes raw, messy human dialogue and transforms it into a structured, machine-compatible stream of intent and en...

AutoML Systems at Scale

6 minute read

“The ultimate bottleneck in machine learning is not data or compute, it is the human engineer. AutoML Systems aim to automate the ‘grad student descent’, tur...

Advanced Caching for ML Systems

20 minute read

“In the world of high-scale machine learning, the fastest inference is the one you never had to compute. Caching is not just about saving time; it’s about ma...

Designing a Malicious URL Classifier: Building the Immune System of the Web

35 minute read

“A single click can compromise a nation. In the battle for the web’s safety, your ML classifier is the only thing standing between a user and a digital catas...

The First Principles of Machine Learning: A Deep Dive into ML/DL Fundamentals

31 minute read

“In the world of high-scale AI, the difference between a model that works in a sandbox and one that survives the real world is a mastery of the first princip...

MLOps & LLMOps: The Production Playbook for Global-Scale AI

24 minute read

"A model in a Jupyter Notebook is a laboratory curiosity. A model in production is a liability until it is governed by a rigorous operations framework."

Designing a Global-Scale Conversational AI: The Chatbot System Design

34 minute read

“Building a chatbot that responds is easy. Building a conversational system that remembers, reasons, and scales to millions of concurrent users without melti...

The Year of Open Reasoning: A Retrospective of 2025’s Major Model Releases

13 minute read

“2024 was the year we learned to talk to machines. 2025 was the year the machines learned to reason with us. This isn’t just a new set of weights; it is a fu...

Precision Speed: 14 Techniques for High-Performance ML Inference

14 minute read

"In the world of high-scale inference, 100 milliseconds isn’t just a delay; it’s a cost center. When serving millions of users, every nanosecond shaved off a...

Speculative decoding in 2026: Saguaro, Nightjar, and the universal draft model

10 minute read

“The draft model sits idle for half the wall-clock time. That’s the bottleneck nobody talks about.”

1-bit LLMs on consumer hardware: what ternary weights actually cost you

7 minute read

“Matrix multiplication without multiplication. That’s not a riddle — it’s how ternary weights work.”

Flash-MoE on MacBook: running 397B parameters on consumer hardware

12 minute read

The first thing you notice when Flash-MoE loads Qwen3.5-397B is that it works. No caveats about reduced functionality. No warning to expect terrible throughp...

KV cache for MoE: the memory wall blocking mixture-of-experts at scale

8 minute read

“MoE sparsifies the computation. The memory bill arrives in full.”

Mistral Small 4: one model, three jobs, and what ‘reasoning effort’ means for serving

4 minute read

“Four models, four deployments, four scaling policies, four monitoring dashboards. Or: one model with a dial.”

The quantization decision tree: which method for which hardware in 2026

15 minute read

Every week, r/LocalLLaMA gets the same post: “I have X GB of VRAM and want to run Y model. What quantization should I use?” The replies converge on the same ...

The residual stream revelation: why KV cache may be theoretically unnecessary

11 minute read

Every team running LLM inference at scale has the same conversation. Someone opens a memory profiler, sees the KV cache consuming most of the GPU memory budg...

Speculative decoding meets 4-bit quantization: why the combination outperforms either alone

4 minute read

“Speculative decoding used to be a research paper. Now it is a checkbox in vLLM.”

TurboQuant: 5x KV cache compression without quality loss

9 minute read

Weight quantization gets all the attention. Quantize to INT8, maybe INT4, watch the benchmark score. But model weights are a one-time cost. The KV cache grow...

vLLM’s semantic router: smarter inference for multi-model deployments

12 minute read

Load balancing assumes requests are interchangeable. They’re not.

Gemma 4: the three architectural decisions that changed what a small model can do

10 minute read

TL;DR: Gemma 4 (Google DeepMind, April 2026, Apache 2.0) ships in four sizes — 2B, 4B multimodal, 26B MoE, 31B dense — with three architectural decisions tha...

The Workload-Router-Pool: how vLLM thinks about fleet inference

10 minute read

TL;DR: The March 2026 vision paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project, MB...

AI harness engineering: what the top labs get right

14 minute read

TL;DR

NanoQuant: sub-1-bit quantization shrinks a 70B model from 138GB to 5.35GB

8 minute read

TL;DR — NanoQuant (arXiv 2602.06694) compresses a 70B model from 138GB to 5.35GB — 26x reduction — while staying competitive on language modeling benchmarks...

QuantSpec: when speculative decoding meets hierarchical KV cache quantization

9 minute read

TL;DR — QuantSpec (arXiv 2502.10424) fuses speculative decoding with hierarchical KV cache quantization. The model’s own quantized layers serve as draft mode...

Blink: CPU-free LLM inference via SmartNIC for 8.47x P99 latency reduction

15 minute read

“We’ve spent five years optimizing the GPU. The CPU was the bottleneck the entire time.”

The build vs buy decision for AI in 2026: my framework after advising startups

8 minute read

TL;DR

SDSL: speculative decoding finally has scaling laws, and they predict your draft model is too big

14 minute read

TL;DR — Speculative decoding has been a tuning problem: pick a draft model, measure acceptance rate, iterate. SDSL (arXiv 2603.11053) turns it into a system...