Recommendation System: Candidate Retrieval
How do you narrow down 10 million items to 1000 candidates in under 50ms? The art of fast retrieval at scale.
Production-grade machine learning system designs covering end-to-end architecture, scalability, and real-world engineering trade-offs. Learn how to build ML systems that serve millions of users.
The posts that cover the decisions that matter most in production:
Each design includes:
Recommendation Systems:
Classification Systems:
Data Infrastructure:
Experimentation & Metrics:
Model Serving:
Model Evaluation:
Feature Engineering:
Model Deployment:
Model Training:
Recommendations:
Computer Vision:
Deep Learning:
MLOps:
Search & Retrieval:
Natural Language Processing:
Data Augmentation:
MLOps & Experiment Tracking:
AutoML & Model Design:
Infrastructure:
Model Optimization:
Unsupervised Learning:
Real-Time Systems:
Search & Ranking:
Computer Vision:
Natural Language Processing:
Real-Time Systems:
Feature Engineering & Stores:
Model Serving & Deployment:
Below you’ll find all ML system design problems in chronological order:
Speculative decoding in 2026: Saguaro, Nightjar, and the universal draft model
1-bit LLMs on consumer hardware: what ternary weights actually cost you
Flash-MoE on MacBook: running 397B parameters on consumer hardware
KV cache for MoE: the memory wall blocking mixture-of-experts at scale
Mistral Small 4: one model, three jobs, and what ‘reasoning effort’ means for serving
The quantization decision tree: which method for which hardware in 2026
The residual stream revelation: why KV cache may be theoretically unnecessary
Speculative decoding meets 4-bit quantization: why the combination outperforms either alone
vLLM’s semantic router: smarter inference for multi-model deployments
Gemma 4: the three architectural decisions that changed what a small model can do
The Workload-Router-Pool: how vLLM thinks about fleet inference
NanoQuant: sub-1-bit quantization shrinks a 70B model from 138GB to 5.35GB
QuantSpec: when speculative decoding meets hierarchical KV cache quantization
Blink: CPU-free LLM inference via SmartNIC for 8.47x P99 latency reduction
The build vs buy decision for AI in 2026: my framework after advising startups
SDSL: speculative decoding finally has scaling laws, and they predict your draft model is too big
Content created with the assistance of large language models and reviewed for technical accuracy.
How do you narrow down 10 million items to 1000 candidates in under 50ms? The art of fast retrieval at scale.
From raw data to production predictions: building a classification pipeline that handles millions of requests with 99.9% uptime.
How to build production-grade pipelines that clean, transform, and validate billions of data points before training.
How to design experimentation platforms that enable rapid iteration while maintaining statistical rigor at scale.
How to choose between batch and real-time inference, the architectural decision that shapes your entire ML serving infrastructure.
How to measure if your ML model is actually good, choosing the right metrics is as important as building the model itself.
Feature engineering makes or breaks ML models, learn how to build scalable, production-ready feature pipelines that power real-world systems.
Design production-grade model serving systems that deliver predictions at scale with low latency and high reliability.
Design systems that learn continuously from streaming data, adapting to changing patterns without full retraining.
Design efficient caching layers for ML systems to reduce latency, save compute costs, and improve user experience at scale.
Design a global CDN for ML systems: Edge caching reduces latency from 500ms to 50ms. Critical for real-time predictions worldwide.
Design distributed ML systems that scale to billions of predictions: Master replication, sharding, consensus, and fault tolerance for production ML.
Build production ML infrastructure that dynamically allocates resources using greedy optimization to maximize throughput and minimize costs.
Build production ensemble systems that combine multiple models using backtracking strategies to explore optimal combinations.
Design production clustering systems that group similar items using hash-based and distance-based approaches for recommendations, search, and analytics.
Build production event stream processing systems that handle millions of events per second using windowing and temporal aggregation, applying the same interv...
Design distributed training architectures that can efficiently process massive sequential datasets and train billion-parameter models across thousands of GPUs.
Design a robust data augmentation pipeline that applies rich transformations to large-scale datasets without becoming the training bottleneck.
Design robust experiment tracking systems that enable systematic exploration, reproducibility, and collaboration across large ML teams.
Design online learning systems that adapt models in real-time using greedy updates, the same adaptive decision-making pattern from Jump Game applied to strea...
Design neural architecture search systems that automatically discover optimal model architectures using dynamic programming and path optimization, the same p...
A comprehensive guide to FinOps for Machine Learning: reducing TCO without compromising accuracy or latency.
The industry-standard algorithm for converting probabilistic model outputs into coherent text sequences.
The critical preprocessing step that defines the vocabulary and capabilities of Large Language Models.
The silent killer of ML models is not a bug in the code, but a change in the world.
Not everything needs to be real-time. Sometimes, “tomorrow morning” is fast enough.
Architecture is destiny. The difference between 50% accuracy and 90% accuracy is often just a skip connection.
How does Google search 50 billion pages in 0.1 seconds? The answer is the “Ranking Funnel”.
“Organizing the world’s information into a structured hierarchy.”
“Leveraging the connection structure to predict what users will love.”
“Managing complex ML workflows with thousands of interdependent tasks.”
“Moving beyond keywords to understand the meaning of a query.”
“Ensuring your ML models are available everywhere, all the time.”
“Structuring the world’s information into connected entities and relationships.”
“Defining where one object ends and another begins.”
“How to share a supercomputer without stepping on each other’s toes.”
“Predicting the next word, the next stock price, the next frame.”
“Finding the perfect knobs to turn.”
“Trust, but verify. Why did the model say No?”
“Scaling from one GPU to thousands.”
“Fitting billion-parameter models into megabytes.”
“The centralized truth for machine learning features.”
“The infrastructure for semantic search and AI-native applications.”
“Serving models that think at human scale.”
“Grounding LLMs in facts, not hallucinations.”
“Standing on the shoulders of giants isn’t just a metaphor, it’s an engineering requirement.”
“Training is Art. Serialization is Logistics. Wars are won on logistics.”
“The user knows what they want. Your job is to tell them before they finish typing.”
“Cron is not an orchestrator. A script is not a pipeline.”
“Before machines could write essays, they had to learn to spell.”
“If data can’t move, move the model, and design the system so the server never sees what matters.”
“Anomaly detection is trapping rain water for metrics: find the boundaries of ‘normal’ and measure what overflows.”
“Most ML failures aren’t model bugs, they’re invalid data quietly passing through.”
“Most ML pipelines are quietly powered by pattern matching, rules, validators, and weak labels before the model ever trains.”
“The best algorithm is the one you didn’t have to tune by hand. AutoML is about moving the engineer from ‘writing code’ to ‘writing the objective function’.”
“Generalization is the goal of ML, but Personalization is the goal of Products. Real-time personalization is about capturing the intent of the ‘now’.”
“Capacity Planning is the art of predicting the future while paying for the present. In ML, it is the difference between a high-growth product and a bankrupt...
“An NLP pipeline is a factory for meaning. It takes raw, messy human dialogue and transforms it into a structured, machine-compatible stream of intent and en...
“The ultimate bottleneck in machine learning is not data or compute, it is the human engineer. AutoML Systems aim to automate the ‘grad student descent’, tur...
“In the world of high-scale machine learning, the fastest inference is the one you never had to compute. Caching is not just about saving time; it’s about ma...
“A single click can compromise a nation. In the battle for the web’s safety, your ML classifier is the only thing standing between a user and a digital catas...
“In the world of high-scale AI, the difference between a model that works in a sandbox and one that survives the real world is a mastery of the first princip...
"A model in a Jupyter Notebook is a laboratory curiosity. A model in production is a liability until it is governed by a rigorous operations framework."
“Building a chatbot that responds is easy. Building a conversational system that remembers, reasons, and scales to millions of concurrent users without melti...
“2024 was the year we learned to talk to machines. 2025 was the year the machines learned to reason with us. This isn’t just a new set of weights; it is a fu...
"In the world of high-scale inference, 100 milliseconds isn’t just a delay; it’s a cost center. When serving millions of users, every nanosecond shaved off a...
“The draft model sits idle for half the wall-clock time. That’s the bottleneck nobody talks about.”
“Matrix multiplication without multiplication. That’s not a riddle — it’s how ternary weights work.”
The first thing you notice when Flash-MoE loads Qwen3.5-397B is that it works. No caveats about reduced functionality. No warning to expect terrible throughp...
“MoE sparsifies the computation. The memory bill arrives in full.”
“Four models, four deployments, four scaling policies, four monitoring dashboards. Or: one model with a dial.”
Every week, r/LocalLLaMA gets the same post: “I have X GB of VRAM and want to run Y model. What quantization should I use?” The replies converge on the same ...
Every team running LLM inference at scale has the same conversation. Someone opens a memory profiler, sees the KV cache consuming most of the GPU memory budg...
“Speculative decoding used to be a research paper. Now it is a checkbox in vLLM.”
Weight quantization gets all the attention. Quantize to INT8, maybe INT4, watch the benchmark score. But model weights are a one-time cost. The KV cache grow...
Load balancing assumes requests are interchangeable. They’re not.
TL;DR: Gemma 4 (Google DeepMind, April 2026, Apache 2.0) ships in four sizes — 2B, 4B multimodal, 26B MoE, 31B dense — with three architectural decisions tha...
TL;DR: The March 2026 vision paper “The Workload-Router-Pool Architecture for LLM Inference Optimization” (arXiv:2603.21354, vLLM Semantic Router project, MB...
TL;DR — NanoQuant (arXiv 2602.06694) compresses a 70B model from 138GB to 5.35GB — 26x reduction — while staying competitive on language modeling benchmarks...
TL;DR — QuantSpec (arXiv 2502.10424) fuses speculative decoding with hierarchical KV cache quantization. The model’s own quantized layers serve as draft mode...
“We’ve spent five years optimizing the GPU. The CPU was the bottleneck the entire time.”
TL;DR — Speculative decoding has been a tuning problem: pick a draft model, measure acceptance rate, iterate. SDSL (arXiv 2603.11053) turns it into a system...