Anomaly Detection

21 minute read

“Anomaly detection is trapping rain water for metrics: find the boundaries of ‘normal’ and measure what overflows.”

1. Problem Statement

In production ML and data systems, anomalies are the early warning signals of:

data quality regressions
pipeline failures
model performance drift
fraud and abuse
infrastructure incidents

The goal is to design an anomaly detection system that:

ingests high-volume metrics/events
detects meaningful anomalies quickly
keeps false positives low (alert fatigue is real)
supports investigation and root cause analysis
is robust to seasonality and shifting baselines

Today’s shared theme is pattern recognition: in DSA, we detect the “boundary pattern” that determines trapped water. In anomaly detection, we detect deviations from the learned “normal pattern”.

2. Understanding the Requirements

2.1 Functional requirements

Inputs
time-series metrics (QPS, latency, error rates)
categorical counts (events by user type, country)
distributions (histograms, sketches)
model signals (loss, accuracy proxies, embedding stats)
Outputs
anomaly alerts with severity
explanation context (top contributing dimensions)
links to dashboards/runbooks
ability to suppress/acknowledge

2.2 Non-functional requirements

Low latency: detect within minutes (or seconds for fraud/infra).
Scalability: multi-tenant at 1M events/sec.
Precision: too many alerts kill trust.
Recall: missing important anomalies is costly.
Robustness: handle missing data, late arrivals, backfills.
Interpretability: support “why did this alert fire?”.
Governance: alert ownership, runbooks, on-call routing.

3. High-Level Architecture

3.1 Diagram

Producers (services / models / pipelines) | v +-------------+ +-------------------+ | Kafka/PubSub| -----> | Stream Processor | +-------------+ | (Flink/Spark) | | +---------+---------+ | | | v | +-------------------+ | | Feature Builder | | | (windows, joins) | | +---------+---------+ | | v v +-------------------+ +-------------------+ | TS Store | | Detector Service | | (Prom/TSDB) | | rules + ML | +---------+---------+ +---------+---------+ | | v v +-------------------+ +-------------------+ | Dashboards | | Alert Router | | (Grafana) | | (Pager/Slack) | +-------------------+ +---------+---------+ | v +-------------------+ | Case Mgmt / RCA | | (tickets, notes) | +-------------------+

3.2 Core idea

You generally run a two-layer detector:

fast rules for clear anomalies (error rate > X, no events for Y minutes)
statistical/ML detectors for subtle drift (seasonality-aware, multivariate)

Rules are cheap and interpretable. ML handles complex patterns, but must be deployed carefully.

4. Component Deep-Dives

4.1 Data ingestion and schema discipline

The most common failure is not “bad model”, it’s bad data:

missing metrics due to instrumentation bugs
unit changes (ms → s)
counter resets
duplicate events

So your system needs:

schema registry (event contracts)
validation (ranges, types, monotonicity for counters)
backfill handling (late arrivals)

4.2 Feature engineering for anomaly detection

Features depend on signal type:

Time series

moving averages, EWMA
seasonal decomposition (daily/weekly)
derivative and acceleration (change rates)

Categorical slices

per-dimension rates (country, device)
top-k contributors (heavy hitters)
ratio metrics (errors / requests)

Distributions

quantiles (p50/p95/p99)
histogram distance metrics (KL divergence, Wasserstein)

This is the “pattern recognition” layer: translate raw signals into something that makes deviations obvious.

4.3 Detectors: rules, statistics, ML

Common detector families:

Rules
simple thresholds, rate-of-change, “no data” checks
best for: clear SLO violations, missing pipelines
Statistical
z-score with robust estimators (median/MAD)
EWMA control charts
seasonal baselines (Holt-Winters)
best for: stable metrics with known periodicity
ML / Unsupervised
Isolation Forest, One-Class SVM
autoencoders
clustering distance to centroids
best for: high-dimensional signals, unknown patterns

The important production truth:

You don’t “pick one model”. You build a system that can run multiple detectors and fuse them.

4.4 Alerting and routing

Anomaly detection is useless without action. You need:

deduplication (same alert grouped)
suppression (maintenance windows)
routing by ownership (service/team)
escalation policies and runbooks

4.5 Explanation (why did this fire?)

For slice-based anomalies, a powerful technique is “top contributors”:

overall error rate rose
80% of delta is from country=IN and device=android_low_end

This reduces MTTR dramatically.

4.6 Thresholding is a product decision (not just statistics)

A detector outputs a score. You still need to decide:

when do we page a human?
when do we log-only?
when do we auto-mitigate?

A practical severity ladder:

S0 (log-only): weak signal, used for trend analysis
S1 (notify): Slack alert, no pager
S2 (page): likely user impact or SLO breach
S3 (auto-mitigate + page): safety-critical systems

How to set thresholds:

start with conservative paging thresholds (avoid alert fatigue)
use historical replay to estimate expected alert volume
adjust per metric and per service criticality

The uncomfortable truth:

“Correct” thresholds depend on team capacity, business impact, and tolerance for noise.

4.7 Root cause analysis (RCA) requires drill-down, not just detection

Detection answers: “something is wrong”. RCA answers: “what is wrong, where, and why”.

A common production pattern:

detect anomaly on an aggregate metric (e.g., error_rate)
compute deltas by dimension to find top contributors:
- region, cluster, device, app_version, request_type
present ranked suspects with evidence

This is essentially “explainability via decomposition”. It often beats fancy ML models in reducing MTTR.

4.8 Handling seasonality and holidays (the hard baseline problem)

Many metrics have:

daily cycles (traffic peaks)
weekly cycles (weekday vs weekend)
event-driven spikes (launches, holidays)

If you ignore seasonality, you’ll page every day at 9am.

Baseline strategies (in increasing complexity):

compare to “same time yesterday / last week”
rolling window median and MAD
Holt-Winters / Prophet-like seasonal models
learned baselines per segment

For production, the best choice is often:

simple seasonal comparisons + robust statistics
plus explicit suppression windows for known events (launch windows)

5. Data Flow (Streaming + Batch)

5.1 Streaming path

ingest events into Kafka
compute rolling windows (1m, 5m, 1h)
compute features (rates, deltas)
run rule-based detectors for immediate signals
run lightweight statistical detectors (EWMA)
emit alerts

5.2 Batch path

nightly backfills, seasonality model fitting
compute baselines per metric and per segment
retrain unsupervised models if used
store parameters in a registry

This hybrid is important because:

streaming is low-latency but limited context
batch has rich history but higher latency

6. Implementation (Core Logic in Python)

The goal here is to show “detector primitives” that can be composed. In production you’d wrap these in streaming jobs and persistent state.

6.1 Robust z-score (median + MAD)

``python import numpy as np

def robust_zscore(x: np.ndarray) -> np.ndarray: “”” Robust z-score using median and MAD (median absolute deviation). Less sensitive to outliers than mean/std. “”” med = np.median(x) mad = np.median(np.abs(x - med)) + 1e-9 return 0.6745 * (x - med) / mad

def detect_anomaly_robust_z(x_window: np.ndarray, threshold: float = 4.0) -> bool: z = robust_zscore(x_window) return abs(z[-1]) >= threshold ``

6.2 EWMA control chart

``python def ewma(series: np.ndarray, alpha: float = 0.2) -> np.ndarray: out = np.zeros_like(series, dtype=float) out[0] = series[0] for i in range(1, len(series)): out[i] = alpha * series[i] + (1 - alpha) * out[i - 1] return out

def detect_ewma_shift(series: np.ndarray, alpha: float = 0.2, k: float = 3.0) -> bool: “”” Simple EWMA shift detector: compare last value to EWMA +/- k * std. (In production, use rolling std and seasonality-aware baselines.) “”” smoothed = ewma(series, alpha=alpha) mu = np.mean(smoothed[:-1]) sigma = np.std(smoothed[:-1]) + 1e-9 return abs(smoothed[-1] - mu) > k * sigma ``

6.3 Fusing detectors (rule + stats)

``python from dataclasses import dataclass

@dataclass class DetectionResult: is_anomaly: bool reason: str

def fused_detector(series: np.ndarray) -> DetectionResult: # Rule: “no data” or impossible values if np.isnan(series[-1]): return DetectionResult(True, “missing_data”) if series[-1] < 0: return DetectionResult(True, “negative_value”)

# Statistical: robust z-score if detect_anomaly_robust_z(series, threshold=4.0): return DetectionResult(True, “robust_zscore”)

# Statistical: EWMA shift if detect_ewma_shift(series, alpha=0.2, k=3.0): return DetectionResult(True, “ewma_shift”)

return DetectionResult(False, “normal”) ``

This illustrates a production principle: anomaly detection is a system of small components, not a single model.

6.4 Multivariate anomaly detection (when one metric isn’t enough)

Many real anomalies only show up when you look at multiple signals together:

latency rises while QPS is stable (infra saturation)
error rate rises only for one endpoint (partial outage)
embedding stats shift while overall accuracy proxy is stable (silent drift)

Multivariate approaches:

Isolation Forest / One-Class SVM on feature vectors
reconstruction error from autoencoders
clustering distance to normal centroids

Here’s a small Isolation Forest example (conceptual; production needs careful feature scaling and segment baselines):

``python from sklearn.ensemble import IsolationForest import numpy as np

def fit_isolation_forest(X: np.ndarray) -> IsolationForest: “”” X shape: [n_samples, n_features] Features could include: latency_p95, error_rate, qps, cpu, mem, etc. “”” model = IsolationForest( n_estimators=200, contamination=”auto”, random_state=0, ) model.fit(X) return model

def detect_multivariate(model: IsolationForest, x_latest: np.ndarray, score_threshold: float) -> bool: # sklearn: higher score = more normal; lower = more anomalous score = float(model.score_samples(x_latest.reshape(1, -1))[0]) return score < score_threshold ``

Operational note:

multivariate models are powerful but harder to explain
pair them with decomposition-style explanations (top contributor slices) for RCA
gate paging on both “model score” and “impact metrics” (e.g., SLO risk)

7. Scaling Strategies

7.1 Cardinality management (the silent killer)

If you alert on metric x country x device x app_version, cardinality explodes. Strategies:

alert only on top-K segments (heavy hitters)
hierarchical detection:
detect anomaly globally
then drill down into segments to find top contributors
sketching (Count-Min Sketch for counts, t-digest/KLL for quantiles)

7.2 State management in streaming

Detectors need history:

rolling windows
seasonal baselines
suppression state

Store state in:

stream processor state (RocksDB in Flink)
or external state stores (Redis, Cassandra)

7.3 Multi-tenancy

Different teams want different sensitivity. You need:

per-metric configs
per-tenant budgets (alert quotas)
isolation (one tenant’s high-cardinality metrics shouldn’t DOS the system)

7.4 Change-log correlation (bridging detection to action)

The fastest way to reduce MTTR is to correlate anomalies with “what changed”:

deploys
feature flags
config pushes
schema changes
infrastructure events (autoscaling, failovers)

A pragmatic design:

ingest change events into the same stream as metrics
attach a “recent changes” panel to every alert
compute correlation candidates (“this alert fired within 10 minutes of deploy X in region Y”)

This does not require fancy causality to be useful. In practice, showing the top 3 recent changes near the anomaly often saves hours.

7.5 Collective anomalies (the ones point detectors miss)

Not all anomalies are single spikes. Common anomaly types:

point anomaly: one bad point
contextual anomaly: normal value at the wrong time (seasonality)
collective anomaly: a sustained subtle shift (e.g., p95 latency +5% for 6 hours)

Collective anomalies are the hardest in production because:

they are “small enough” to evade thresholds
but “long enough” to cause real user impact

Detectors for collective anomalies:

EWMA / CUSUM-style change detection
burn-rate based alerting (SLO-focused)
rolling baseline comparisons (same time last week)

8. Monitoring & Metrics (for the anomaly system itself)

You need to monitor your monitor:

alert volume per tenant
false positive rate proxies (how many alerts are auto-acked)
time-to-ack, time-to-resolve
detector latency and backlog
“missing data” alert rates (often instrumentation failures)

And you need a feedback loop: when humans label an alert as “noise”, it should improve the system.

8.1 Evaluating anomaly detection (without perfect labels)

Unlike classification, anomaly detection often lacks ground truth. Practical evaluation strategies:

Historical replay
replay a week/month of metrics
compare alerts to known incidents and change logs
Synthetic injections
inject controlled anomalies (drop QPS, spike errors, shift distributions)
ensure detectors catch them with acceptable delay
Proxy labels
incident tickets, rollbacks, on-call notes
not perfect, but useful for measuring recall on “real pain”
Human-in-the-loop tuning
collect feedback (noise vs real)
tune thresholds and suppression policies

You typically optimize for:

high precision for paging alerts
high recall for “notify” alerts (lower severity)

8.2 “Impact gating” (avoid paging on harmless anomalies)

A classic production trick: don’t page just because a detector score is high. Page when:

anomaly score is high and
impact metrics indicate user risk

Examples:

error rate anomaly + SLO burn rate increasing
latency anomaly + p99 above a hard threshold

This reduces alert fatigue while keeping critical incidents caught.

9. Failure Modes (and Mitigations)

9.1 Alert fatigue

Mitigation:

dedupe + grouping
severity levels
alert budgets and rate limits
better routing + suppression windows

9.2 Seasonality false positives

Mitigation:

baseline models (daily/weekly)
compare to same time last week
use robust estimators

9.3 Missing data vs true zero

Mitigation:

explicit “heartbeat” signals
separate pipelines for “metric missing” vs “value is zero”

9.4 Feedback loops and gaming

If teams learn to “silence alerts”, your system loses trust. Mitigation:

governance: who can suppress, for how long
audit logs
periodic review of suppressed alerts

9.5 High-cardinality explosions

A common production outage pattern:

someone adds a new label/dimension (e.g., user_id)
cardinality explodes
storage and stream processing costs spike
detectors become meaningless (each series has too few points)

Mitigations:

enforce label allowlists/denylists
use hierarchical detection (alert on aggregate first, then drill down)
cap top-k dimensions and use heavy-hitter sketches
charge back cost to teams for uncontrolled cardinality (governance matters)

9.6 “Anomaly” isn’t always “bad” (launches look like incidents)

Product launches and marketing events create legitimate shifts. If the system treats every spike as an incident, it loses credibility.

Mitigations:

change-log correlation: if a known launch happened, adjust baseline
planned maintenance windows (suppression)
“expected anomaly” annotations (Grafana-style)

This is another reason anomaly detection is a system: you need metadata and governance, not just math.

10. Real-World Case Study

Netflix/Streaming quality

Netflix-like systems track:

playback start failures
buffering ratio
CDN health

Anomalies matter because:

small regressions affect millions of users quickly
seasonality exists (prime time)

Common pattern:

global anomaly triggers
drill-down identifies region/device/CDN nodes as top contributors
routing goes to the owning team (CDN vs player vs backend)

10.1 A second case study: data pipeline anomaly in ML training

Consider a feature pipeline that writes training data daily. One day, an upstream service changes a field from “milliseconds” to “seconds”.

What happens:

models still train (no crash)
loss might even look “okay”
but downstream predictions degrade subtly

Anomaly detection can catch this early with:

schema validation: units and ranges
distribution shift: feature histograms vs baseline
training metrics drift: gradient norms, embedding stats

The key lesson: the most expensive ML failures are often “silent correctness failures”. You need anomaly detection not just for infra metrics, but for data and model health signals.

10.2 An “RCA-first” UI pattern

When an alert fires, show:

the metric that triggered
the baseline comparison (last week, seasonal)
the top contributing dimensions (country/device/version)
links to relevant dashboards and runbooks
recent change log correlation (deploys, config changes, schema changes)

This turns anomaly detection from “pager noise” into a decision system.

11. Cost Analysis

Cost drivers:

stream processing state (memory/disk)
high-cardinality storage
alert routing and case management integration

Cost levers:

limit cardinality and use sketches
run heavy ML detectors only when cheaper detectors suspect an anomaly
compress histories and store only what’s needed for RCA windows

12. Key Takeaways

Anomaly detection is a system, not a model: ingestion, features, detectors, alerting, RCA.
Boundaries and invariants matter: define “normal” and detect boundary violations, like two-pointer reasoning.
Operational success = low false positives + good explanations: otherwise no one trusts alerts.

12.1 Appendix: detector selection cheat sheet

Use this as a quick mapping from problem → detector:

Signal type	Typical anomaly	Best first detector	Notes
Counters (QPS, errors)	drop to zero, spike	rules + burn-rate	also detect missing vs true zero
Latency percentiles	sustained shift	EWMA / baseline compare	seasonality is common
High-dimensional telemetry	weird combinations	Isolation Forest / AE	require careful explanation
Categorical slices	one segment dominates	hierarchical drill-down	“top contributors” is key
Data features (ML)	distribution shift	histogram distance + rules	unit changes are common

Rule of thumb:

page on rules + high-impact metrics
use ML scores as supporting evidence or for “notify” tier alerts

12.2 Appendix: an alerting playbook (how to keep trust)

If your anomaly system is noisy, it dies. A practical playbook:

Start narrow
detect a small set of critical metrics
keep paging volume low
Add explainability
top contributors
baseline comparisons
recent change log correlation
Use severity tiers
paging only for high-confidence/high-impact signals
Slack-only for exploratory detectors
Build feedback
allow “noise” labeling
review suppressed alerts periodically

This is the same “pattern recognition” lesson as the DSA problem: don’t react to every fluctuation; define the boundary of “actionable abnormal”.

12.3 Appendix: baseline modeling options (from simplest to strongest)

When teams say “anomaly detection”, they often mean “baseline modeling”. Here’s a pragmatic ladder:

Static thresholds
best for: hard limits (error_rate must be < 1%)
worst for: seasonal metrics
Rolling robust baseline (median/MAD)
best for: noisy but mostly stationary signals
robust to spikes and outliers
Seasonal baseline (same time last week)
best for: strong weekly patterns
easy to explain and cheap to run
Holt-Winters / ETS
best for: smooth seasonality + trend
good interpretability
Learned baselines (per segment)
best for: complex multi-segment products
requires strong governance to avoid accidental bias

Production tip: start with seasonal comparisons + robust statistics. Only move to heavier models when simpler baselines fail.

12.4 Appendix: a minimal “config contract” for detectors

To run this system safely at scale, you need a configuration contract:

metric name
aggregation window (1m/5m/1h)
baseline method
sensitivity / threshold
severity tier (log/notify/page)
routing owner (team/on-call)
suppression windows (maintenance)

This turns anomaly detection into an operable platform: teams can onboard new signals without changing code, and you can audit changes.

12.5 Appendix: anomaly vs drift vs outlier (how teams get confused)

These terms are often used interchangeably, but they’re different:

Outlier: a single unusual point relative to its neighbors.
Example: one batch has a huge spike in null values.
Anomaly: a point or segment that violates an expected pattern.
Example: error rate spike at 2am that is not part of normal seasonality.
Drift: the underlying distribution changes over time.
Example: feature distribution slowly shifts after a product change.

Operationally:

outliers are often “data quality” issues
anomalies are “incident” candidates
drift is “model health” and requires different response (retraining, feature changes)

The best platforms support all three, but keep the response paths distinct:

anomalies page humans
drift triggers investigation and model iteration
outliers often trigger data validation and pipeline fixes

12.6 Appendix: an on-call runbook for anomaly alerts

When an alert pages a human, the system should also provide a runbook-style checklist. A practical sequence:

Confirm impact
- are user-facing SLOs burning?
- is this confined to a segment (region/device) or global?
Check data validity
- missing data vs real zeros
- instrumentation changes (new labels, new units)
Correlate with recent changes
- deploys, feature flags, config changes
- schema changes in pipelines
Drill down
- top contributors by dimension
- compare against historical baselines (same time last week)
Mitigate
- rollback suspect deploys
- fail over traffic if it’s infra-related
- suppress alerts only with an explicit maintenance annotation

This checklist turns “anomaly detection” into operational reality: consistent response beats clever scoring.

12.7 Appendix: testing detectors safely before production

Before enabling paging:

run in shadow mode (log-only) for 1–2 weeks
replay historical incidents and ensure alerts would have fired
inject synthetic anomalies (spikes, drops, sustained shifts)
check alert volume budgets per team

If you do this, you avoid the most common failure: shipping a detector that immediately pages everyone and loses trust forever.

12.8 Appendix: cost and scaling intuition (why platforms exist)

Teams often start anomaly detection as ad-hoc scripts per service. At small scale, that works. At org scale, it collapses due to:

duplicated logic (everyone re-implements EWMA differently)
inconsistent thresholds and alert policies
uncontrolled cardinality costs
lack of auditability (“who changed this threshold?”)

That’s why anomaly detection becomes a platform:

shared ingestion and schema validation
shared baseline modeling primitives
shared routing/suppression/governance
shared RCA UI patterns (top contributors, change-log correlation)

If you can explain this evolution clearly in interviews, it signals strong “systems taste”.

12.9 Appendix: multi-tenancy and fairness (an under-discussed risk)

Anomaly systems can create “fairness” issues at the org level:

teams with noisier metrics consume disproportionate paging attention
teams with low traffic get ignored because their signals are sparse
some regions/devices get under-monitored because baselines are not segment-aware

Mitigations:

per-tenant alert budgets (rate limits and quotas)
segment-aware baselines for critical dimensions (region/device/app version)
“coverage” metrics: which segments have enough data to monitor reliably
explicit ownership: every alert route must map to a team/runbook

This is the same lesson as other large platforms: governance is part of the technical design.

12.10 Appendix: a minimal anomaly “triage packet” (what every alert should carry)

To make alerts actionable, every alert should include a small, standardized packet:

What: metric name, current value, baseline value, anomaly score
When: start time, duration, detection time
Where: top affected segments (region/device/version) with contribution deltas
Impact: SLO burn rate / user-impact proxy
Why likely: top recent change events correlated (deploy/flag/config/schema)
Next: runbook link + owning team + mitigation suggestions (rollback, failover)

This single design choice reduces MTTR more than most detector tweaks, because it turns “math output” into “operational context”.

Originally published at: arunbaby.com/ml-system-design/0052-anomaly-detection

If you found this helpful, consider sharing it with others who might benefit.

1. Problem Statement

2. Understanding the Requirements

2.1 Functional requirements

2.2 Non-functional requirements

3. High-Level Architecture

3.1 Diagram

3.2 Core idea

4. Component Deep-Dives

4.1 Data ingestion and schema discipline

4.2 Feature engineering for anomaly detection

4.3 Detectors: rules, statistics, ML

4.4 Alerting and routing

4.5 Explanation (why did this fire?)

4.6 Thresholding is a product decision (not just statistics)

4.7 Root cause analysis (RCA) requires drill-down, not just detection

4.8 Handling seasonality and holidays (the hard baseline problem)

5. Data Flow (Streaming + Batch)

5.1 Streaming path

5.2 Batch path

6. Implementation (Core Logic in Python)

6.1 Robust z-score (median + MAD)

6.2 EWMA control chart

6.3 Fusing detectors (rule + stats)

6.4 Multivariate anomaly detection (when one metric isn’t enough)

7. Scaling Strategies

7.1 Cardinality management (the silent killer)

7.2 State management in streaming

7.3 Multi-tenancy

7.4 Change-log correlation (bridging detection to action)

7.5 Collective anomalies (the ones point detectors miss)

8. Monitoring & Metrics (for the anomaly system itself)

8.1 Evaluating anomaly detection (without perfect labels)

8.2 “Impact gating” (avoid paging on harmless anomalies)

9. Failure Modes (and Mitigations)

9.1 Alert fatigue

9.2 Seasonality false positives

9.3 Missing data vs true zero

9.4 Feedback loops and gaming

9.5 High-cardinality explosions

9.6 “Anomaly” isn’t always “bad” (launches look like incidents)

10. Real-World Case Study

Netflix/Streaming quality

10.1 A second case study: data pipeline anomaly in ML training

10.2 An “RCA-first” UI pattern

11. Cost Analysis

12. Key Takeaways

12.1 Appendix: detector selection cheat sheet

12.2 Appendix: an alerting playbook (how to keep trust)

12.3 Appendix: baseline modeling options (from simplest to strongest)

12.4 Appendix: a minimal “config contract” for detectors

12.5 Appendix: anomaly vs drift vs outlier (how teams get confused)

12.6 Appendix: an on-call runbook for anomaly alerts

12.7 Appendix: testing detectors safely before production

12.8 Appendix: cost and scaling intuition (why platforms exist)

12.9 Appendix: multi-tenancy and fairness (an under-discussed risk)

12.10 Appendix: a minimal anomaly “triage packet” (what every alert should carry)

Related across topics

Trapping Rain Water

Speech Anomaly Detection

Long-Context Agent Strategies

Share on