Model Replication Systems
“Ensuring your ML models are available everywhere, all the time.”
TL;DR
Model replication deploys multiple copies of ML models across servers and regions for high availability (99.99 percent uptime), low latency (serve from nearest region), and load balancing. Active-active replication maximizes throughput while canary deployments gradually shift traffic to reduce risk. Synchronization uses push-based, pull-based (with jitter to prevent thundering herds), or event-driven (Kafka) patterns. Kubernetes HPA auto-scales based on CPU or custom metrics, while Docker containerization and model registries ensure reproducible deployments. For monitoring these replicas in production, see model monitoring systems, and for the pipeline that feeds new models, see ML pipeline orchestration.

1. The Problem: Single Point of Failure
Deploying an ML model to a single server is a recipe for disaster:
- Server crashes → Service unavailable.
- Network partition → Users in Europe can’t reach a US-based server.
- Traffic spike → Server overloaded.
Model Replication solves this by deploying multiple copies of the model across different servers, regions, or data centers.
2. Why Replicate Models?
1. High Availability (HA):
- If one replica fails, traffic routes to healthy replicas.
- Target: 99.99% uptime (< 53 minutes downtime per year).
2. Low Latency:
- Serve users from the nearest replica.
- US users → US server, EU users → EU server.
- Reduces latency from 200ms to 20ms.
3. Load Balancing:
- Distribute inference requests across replicas.
- Prevents any single server from being overwhelmed.
4. Disaster Recovery:
- If an entire data center goes down (fire, earthquake), other regions continue serving.
3. Replication Strategies
Strategy 1: Active-Active (Multi-Master)
All replicas actively serve traffic simultaneously.
Architecture:
Load Balancer
/ | \
Replica1 Replica2 Replica3
(US-West) (US-East) (EU-West)
Pros:
- Full traffic distribution.
- Maximum throughput.
Cons:
- Requires synchronization if models have state (rare for inference).
Strategy 2: Active-Passive (Primary-Standby)
One replica serves traffic. Others are on standby.
Architecture:
Primary (Active)
|
Standby 1, Standby 2 (Passive)
Pros:
- Simple to implement.
- Standby can be used for testing new models.
Cons:
- Underutilized resources (standby idles).
- Failover delay (10-60 seconds).
Strategy 3: Geo-Replication
Replicas deployed in different geographic regions.
Architecture:
US-West Cluster <---> US-East Cluster <---> EU Cluster
Pros:
- Complies with data residency laws (GDPR: EU data stays in EU).
- Low latency for global users.
Cons:
- Higher cost (multi-region infrastructure).
- Complex network topology.
4. Model Synchronization Patterns
Challenge: How do we ensure all replicas serve the same model version?
Pattern 1: Push-Based Deployment
A central controller pushes new models to all replicas.
Flow:
- Train new model.
- Upload to central storage (S3).
- Controller triggers deployment to all replicas.
- Replicas pull the model and reload.
# Pseudo-code for controller
def deploy_model(model_path, replicas):
for replica in replicas:
replica.pull_model(model_path)
replica.reload()
Pros:
- Centralized control.
- Ensures consistency.
Cons:
- Single point of failure (controller).
- Deployment can be slow (sequential updates).
Pattern 2: Pull-Based Deployment (Polling)
Each replica periodically checks for new models.
Flow:
- Upload model to S3.
- Replicas poll S3 every 60 seconds.
- If new model detected, download and reload.
import time
import hashlib
def poll_for_updates(model_url, current_hash):
while True:
new_hash = get_model_hash(model_url)
if new_hash != current_hash:
download_model(model_url)
reload_model()
current_hash = new_hash
time.sleep(60)
Pros:
- Decentralized (no controller).
- Replicas update independently.
Cons:
- Polling overhead.
- Inconsistent state (replicas update at different times).
Pattern 3: Event-Driven Deployment
Replicas subscribe to a message queue (Kafka, SNS). Controller publishes a “new model available” event.
Flow:
- Upload model to S3.
- Publish message to Kafka topic:
model-updates. - Replicas consume messages and download the new model.
from kafka import KafkaConsumer
consumer = KafkaConsumer('model-updates')
for message in consumer:
model_url = message.value
download_model(model_url)
reload_model()
Pros:
- Real-time updates.
- Decoupled controller and replicas.
Cons:
- Requires message queue infrastructure.
- Ordering guarantees needed.
5. Deep Dive: Canary Deployment
Problem: A new model might have bugs. Rolling out to all replicas at once is risky.
Solution: Canary Deployment
- Deploy new model to 1% of replicas (canaries).
- Monitor metrics (latency, error rate, accuracy).
- If healthy, gradually increase to 10%, 50%, 100%.
- If unhealthy, rollback immediately.
Architecture:
Load Balancer
|
├─ 99% traffic → Old Model (v1)
└─ 1% traffic → New Model (v2) [Canary]
Implementation with Kubernetes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-v1
spec:
replicas: 99
template:
spec:
containers:
- name: model
image: model:v1
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-v2-canary
spec:
replicas: 1
template:
spec:
containers:
- name: model
image: model:v2
Monitoring Canary:
def monitor_canary(canary_metrics, baseline_metrics):
if canary_metrics['error_rate'] > baseline_metrics['error_rate'] * 1.5:
rollback()
elif canary_metrics['p99_latency'] > baseline_metrics['p99_latency'] * 1.2:
rollback()
else:
promote_canary()
6. Deep Dive: Blue-Green Deployment
Concept: Maintain two identical environments: Blue (current) and Green (new).
Flow:
- Blue is live (serving 100% traffic).
- Deploy new model to Green.
- Test Green (smoke tests, load tests).
- Switch traffic from Blue to Green (atomic switch).
- If issues arise, switch back to Blue instantly.
Pros:
- Zero-downtime deployment.
- Instant rollback.
Cons:
- Requires 2x resources (both environments running).
7. Deep Dive: Model Versioning and Rollback
Challenge: How do we track which model version is deployed where?
Solution: Model Registry
- Central database of all model versions.
- Each model tagged with: version, timestamp, accuracy metrics, deployment status.
Schema:
CREATE TABLE models (
model_id UUID PRIMARY KEY,
version VARCHAR,
created_at TIMESTAMP,
accuracy FLOAT,
deployed_replicas INTEGER,
status VARCHAR -- 'training', 'canary', 'production', 'deprecated'
);
Rollback Strategy:
- Detect issue (alert triggered).
- Query model registry for last known good version.
- Trigger redeployment to all replicas.
def rollback():
last_good_version = db.query(
"SELECT version FROM models WHERE status='production' ORDER BY created_at DESC LIMIT 1"
)
deploy_model(last_good_version, replicas=all_replicas)
8. Deep Dive: Handling Stateful Models
Most inference models are stateless (same input → same output). But some models maintain state:
Examples:
- Recommendation Systems: User session history.
- Chatbots: Conversation context.
- Reinforcement Learning: Agent state.
Challenge with Replication: If a user’s first request goes to Replica 1, the second request might go to Replica 2 (which doesn’t have the session state).
Solution 1: Sticky Sessions Route all requests from the same user to the same replica.
upstream backend {
ip_hash; # Hash user IP to the same server
server replica1:8000;
server replica2:8000;
}
Cons: If a replica dies, user sessions are lost.
Solution 2: Shared State Store (Redis) All replicas read/write state to a central Redis cluster.
import redis
redis_client = redis.Redis()
def predict(user_id, features):
# Load user state
state = redis_client.get(f"user:{user_id}:state")
# Run inference
result = model.predict(features, state)
# Update state
redis_client.set(f"user:{user_id}:state", result['new_state'])
return result['prediction']
Pros: Stateless replicas, can route to any replica. Cons: Redis becomes a bottleneck.
9. Deep Dive: Cross-Region Replication
Scenario: Replicate a recommendation model across US, EU, and Asia.
Challenges:
- Model Artifact Size: 5 GB model → slow to transfer across continents.
- Network Latency: EU → Asia = 300ms.
- Cost: Cross-region data transfer is expensive ($0.02/GB in AWS).
Optimization 1: Delta Updates Don’t transfer the entire model. Only send the changed weights.
def compute_delta(old_model, new_model):
delta = {}
for layer in new_model.layers:
delta[layer.name] = new_model.weights[layer.name] - old_model.weights[layer.name]
return delta
def apply_delta(model, delta):
for layer_name, weight_diff in delta.items():
model.weights[layer_name] += weight_diff
Optimization 2: Model Compression Compress the model before transfer (gzip, quantization).
import gzip
# Compress
with open('model.pkl', 'rb') as f_in:
with gzip.open('model.pkl.gz', 'wb') as f_out:
f_out.writelines(f_in)
# Transfer compressed file (5 GB → 1 GB)
Optimization 3: Regional Model Stores Replicate the model to regional S3 buckets.
- US replicas pull from
s3://models-us-west/ - EU replicas pull from
s3://models-eu-west/
def get_model_url(region):
base_url = f"s3://models-{region}"
return f"{base_url}/model-v123.pkl"
10. Deep Dive: Health Checks and Auto-Scaling
Health Check Types:
- Liveness Probe: Is the server running?
- HTTP GET
/health→ 200 OK.
- HTTP GET
- Readiness Probe: Is the server ready to serve traffic?
- Check: Model loaded? Database connected?
from fastapi import FastAPI
app = FastAPI()
model = None
@app.on_event("startup")
def load_model():
global model
model = load_model_from_s3()
@app.get("/health")
def liveness():
return {"status": "alive"}
@app.get("/ready")
def readiness():
if model is None:
return {"status": "not ready"}, 503
return {"status": "ready"}
Auto-Scaling: Scale replicas based on traffic.
Kubernetes Horizontal Pod Autoscaler (HPA):
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model
minReplicas: 3
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
When CPU > 70%, Kubernetes spawns more pods. When CPU < 70%, it scales down.
11. Real-World Case Studies
Case Study 1: Netflix Model Replication
Netflix uses A/B testing at scale across global replicas.
Architecture:
- Deploy Model A to US-West.
- Deploy Model B to US-East.
- Compare engagement metrics (watch time).
- Winner gets deployed globally.
Case Study 2: Uber’s Michelangelo
Uber’s ML platform deploys models to hundreds of cities.
Challenge: Each city has different demand patterns. Solution: City-specific models.
model-san-francisco-v10model-new-york-v12
Replication:
- Each city’s model is replicated 10x within that region’s data center.
Case Study 3: Spotify Recommendations
Spotify serves personalized playlists to 500M users.
Architecture:
- 50+ replicas per region.
- Models deployed via Kubernetes.
- Canary deployment for new models (1% → 100%).
Implementation: Model Replication with Docker + Kubernetes
Step 1: Dockerize the Model
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model.pkl model.pkl
COPY serve.py serve.py
CMD ["python", "serve.py"]
Step 2: Deploy to Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-inference
spec:
replicas: 10
selector:
matchLabels:
app: model
template:
metadata:
labels:
app: model
spec:
containers:
- name: model
image: my-registry/model:v1
ports:
- containerPort: 8000
resources:
requests:
memory: "2Gi"
cpu: "1"
limits:
memory: "4Gi"
cpu: "2"
Step 3: Expose via Load Balancer
apiVersion: v1
kind: Service
metadata:
name: model-service
spec:
type: LoadBalancer
selector:
app: model
ports:
- protocol: TCP
port: 80
targetPort: 8000
Top Interview Questions
Q1: How do you ensure all replicas serve the same model version? Answer:
- Use a model registry with version tracking.
- Deploy atomically (all replicas pull the same version).
- Use checksums/hashes to verify model integrity.
Q2: What happens if a replica is serving an old model version? Answer:
- Monitoring: Track model version per replica.
- Alert: If version mismatch detected, trigger alert.
- Force Update: Controller sends a “reload model” command.
Q3: How do you handle model deployment in a multi-region setup? Answer:
- Regional Rollout: Deploy to one region first (canary).
- Monitor: Check error rates, latency.
- Progressive Rollout: If healthy, deploy to other regions.
Q4: What’s the difference between horizontal and vertical scaling for model replicas? Answer:
- Horizontal: Add more replicas (more servers). Better for handling high traffic.
- Vertical: Increase resources per replica (more CPU/RAM). Better for larger models.
Q5: How do you minimize downtime during model updates? Answer:
- Rolling Update: Update replicas one at a time.
- Blue-Green: Maintain two environments, switch traffic atomically.
- Canary: Gradually shift traffic to the new model.
12. Deep Dive: Shadow Deployment (Dark Launch)
Concept: Deploy a new model alongside the old model, but don’t serve its predictions to users. Instead, log predictions for comparison.
Architecture:
User Request
↓
Load Balancer
↓
┌───────────────────────┐
│ Primary Model (v1) │ → Response to User
│ Shadow Model (v2) │ → Predictions logged (not served)
└───────────────────────┘
Implementation:
import logging
class ShadowPredictor:
def __init__(self, primary_model, shadow_model):
self.primary = primary_model
self.shadow = shadow_model
def predict(self, features):
# Primary prediction (served to user)
primary_pred = self.primary.predict(features)
# Shadow prediction (async logging)
try:
shadow_pred = self.shadow.predict(features)
logging.info(f"Shadow pred: {shadow_pred}, Primary: {primary_pred}")
except Exception as e:
logging.error(f"Shadow model failed: {e}")
return primary_pred
Benefits:
- Zero Risk: Users never see shadow model predictions.
- Real Production Data: Validate on actual user queries.
- Performance Comparison: Compare latency, accuracy in real-time.
Use Case: Testing a radically different model architecture without risk.
13. Deep Dive: Gradual Traffic Shifting
More sophisticated than binary canary deployment, gradually shift traffic over hours/days.
Strategy:
Hour 0: 0% v2, 100% v1
Hour 1: 5% v2, 95% v1
Hour 2: 10% v2, 90% v1
Hour 4: 25% v2, 75% v1
Hour 8: 50% v2, 50% v1
Hour 12: 100% v2, 0% v1
Implementation with Feature Flags:
import random
class FeatureFlag:
def __init__(self):
self.rollout_percentage = 0 # Start at 0%
def should_use_new_model(self, user_id):
# Consistent hashing: same user always gets same experience
user_hash = hash(user_id) % 100
return user_hash < self.rollout_percentage
def set_rollout(self, percentage):
self.rollout_percentage = percentage
# Usage
flag = FeatureFlag()
flag.set_rollout(25) # 25% of users see new model
def predict(user_id, features):
if flag.should_use_new_model(user_id):
return model_v2.predict(features)
else:
return model_v1.predict(features)
14. Deep Dive: Model Performance Monitoring
Metrics to Track:
1. Latency Metrics:
- P50 (Median): 50% of requests complete in X ms.
- P95: 95% of requests complete in X ms.
- P99: 99% of requests (worst 1%) complete in X ms.
Why P99 matters: Even if P50 is 10ms, P99 of 5000ms means some users have terrible experience.
2. Throughput Metrics:
- Requests Per Second (RPS): How many requests the replica can handle.
- Saturation: % of capacity used. If > 80%, scale up.
3. Error Metrics:
- Error Rate: % of requests that fail.
- Error Types: Timeout, Model Error, OOM, Network Error.
4. Business Metrics:
- Click-Through Rate (CTR): For recommendation models.
- Conversion Rate: For ranking models.
- Revenue Per User: For personalization models.
Prometheus Example:
from prometheus_client import Counter, Histogram, Gauge
# Counters
prediction_counter = Counter(
'model_predictions_total',
'Total predictions',
['model_version', 'replica_id']
)
# Histograms (for latency)
prediction_latency = Histogram(
'model_prediction_latency_seconds',
'Prediction latency',
['model_version'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
# Gauges (for current state)
active_replicas = Gauge(
'model_active_replicas',
'Number of active replicas',
['model_version']
)
#Usage
@app.post("/predict")
def predict(features):
with prediction_latency.labels(model_version='v2').time():
result = model.predict(features)
prediction_counter.labels(
model_version='v2',
replica_id=get_replica_id()
).inc()
return result
15. Deep Dive: Cost Optimization
Replicating models is expensive. How do we minimize cost without sacrificing reliability?
Strategy 1: Right-Sizing Replicas Don’t over-provision. Use actual traffic data to determine replica count.
Formula: \[ \text{Required Replicas} = \frac{\text{Peak RPS}}{\text{RPS per Replica}} \times \text{Safety Factor} \]
Example:
- Peak traffic: 10,000 RPS
- Each replica handles: 500 RPS
- Safety factor: 1.5 (for spikes)
- Required:
(10,000 / 500) \times 1.5 = 30replicas
Strategy 2: Spot Instances for Non-Critical Replicas AWS Spot Instances are 70% cheaper but can be terminated.
# Kubernetes Node Selector for Spot Instances
spec:
nodeSelector:
instance-type: spot
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Use Case: Use spot instances for 50% of replicas. If terminated, route to on-demand replicas.
Strategy 3: Model Quantization Reduce model size from FP32 to INT8.
- Size: 4 GB → 1 GB (4x reduction)
- Inference Speed: 2-3x faster
- Cost: Fit 4x more replicas per server
import torch
# Quantize model
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Size comparison
original_size = os.path.getsize('model_fp32.pt')
quantized_size = os.path.getsize('model_int8.pt')
print(f"Compression ratio: {original_size / quantized_size:.2f}x")
Strategy 4: Auto-Scaling with Predictive Scaling Don’t wait for load to spike. Predict it.
def predict_traffic(hour_of_day, day_of_week):
# Historical average
baseline = historical_traffic[day_of_week][hour_of_day]
# Scale up 5 minutes before predicted spike
return int(baseline * 1.2)
# Cron job runs every 5 minutes
current_hour = datetime.now().hour
predicted_rps = predict_traffic(current_hour, datetime.now().weekday())
required_replicas = calculate_replicas(predicted_rps)
scale_to(required_replicas)
16. Deep Dive: Security Considerations
1. Model Theft Prevention: Large models (GPT-4, Stable Diffusion) are valuable IP. Prevent extraction.
Attack Vector: Adversary queries model repeatedly to reverse-engineer weights. Defense: Rate limiting, query auditing.
from collections import defaultdict
import time
class RateLimiter:
def __init__(self, max_requests_per_minute=100):
self.requests = defaultdict(list)
self.limit = max_requests_per_minute
def allow_request(self, user_id):
now = time.time()
# Remove requests older than 1 minute
self.requests[user_id] = [
ts for ts in self.requests[user_id]
if now - ts < 60
]
if len(self.requests[user_id]) >= self.limit:
return False
self.requests[user_id].append(now)
return True
2. Data Privacy: Ensure replicas don’t log sensitive user data.
import re
def sanitize_logs(log_message):
# Redact emails, phone numbers, SSNs
log_message = re.sub(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b', '[REDACTED_EMAIL]', log_message, flags=re.I)
log_message = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[REDACTED_SSN]', log_message)
return log_message
logging.info(sanitize_logs(f"User query: {user_input}"))
3. Model Integrity: Verify that the deployed model hasn’t been tampered with.
import hashlib
def verify_model_integrity(model_path, expected_hash):
with open(model_path, 'rb') as f:
model_bytes = f.read()
computed_hash = hashlib.sha256(model_bytes).hexdigest()
if computed_hash != expected_hash:
raise SecurityError("Model file has been tampered with!")
return True
# On deployment
expected_hash = "abc123...def" # From model registry
verify_model_integrity('/models/model.pt', expected_hash)
17. Deep Dive: Disaster Recovery Drills
Problem: You have multi-region replication. But have you tested failover?
DR Drill Strategy:
- Planned Outage: Intentionally take down a region.
- Monitor: Verify traffic shifts to healthy regions.
- Measure: Recovery time, user impact.
- Document: Update runbooks.
Chaos Engineering with Netflix’s Chaos Monkey:
import random
def chaos_monkey():
# Randomly terminate 1 replica every hour
if random.random() < 0.1: # 10% chance
replica_id = random.choice(get_all_replicas())
logging.warning(f"Chaos Monkey terminating replica {replica_id}")
terminate_replica(replica_id)
Result: Forces teams to build resilient systems.
18. Production War Stories
War Story 1: The Silent Canary Failure A team deployed a canary model. Metrics looked good (latency, error rate). But revenue dropped 15%. Root Cause: New model was serving lower quality recommendations. Users clicked less. Lesson: Track business metrics, not just technical metrics.
War Story 2: The Cross-Region Sync Disaster A model was deployed to EU-West. The EU-East replica was 2 hours behind (polling delay). Result: Users in Paris saw different recommendations than users in Berlin. Lesson: Use event-driven deployment for consistency guarantees.
War Story 3: The Thundering Herd 1000 replicas all polled S3 at the same time (every 60 seconds, on the minute). Result: S3 rate limit errors. Models failed to update. Lesson: Add jitter to polling intervals.
import time
import random
def poll_with_jitter(base_interval=60):
while True:
# Sleep 60s ± 10s
jitter = random.uniform(-10, 10)
time.sleep(base_interval + jitter)
check_for_updates()
19. Deep Dive: A/B Testing Infrastructure for Model Replicas
Problem: You have model v1 and v2. Which one is better?
Solution: A/B test them on real users.
Architecture:
User Request
↓
Feature Flag Service (reads user_id)
↓
├─ 50% → Model v1 (Control)
└─ 50% → Model v2 (Treatment)
Implementation:
import hashlib
class ABTestRouter:
def __init__(self):
self.experiments = {}
def create_experiment(self, exp_id, control_model, treatment_model, traffic_split=0.5):
self.experiments[exp_id] = {
'control': control_model,
'treatment': treatment_model,
'split': traffic_split
}
def get_model(self, exp_id, user_id):
exp = self.experiments[exp_id]
# Consistent hashing: same user always gets same variant
user_hash = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
bucket = (user_hash % 100) / 100.0
if bucket < exp['split']:
return exp['treatment']
else:
return exp['control']
# Usage
router = ABTestRouter()
router.create_experiment(
exp_id='model_v2_test',
control_model=model_v1,
treatment_model=model_v2,
traffic_split=0.1 # 10% treatment, 90% control
)
def predict(user_id, features):
model = router.get_model('model_v2_test', user_id)
return model.predict(features)
Statistical Significance: After 1 week, analyze results:
from scipy import stats
def analyze_ab_test(control_metrics, treatment_metrics):
# T-test for statistical significance
t_stat, p_value = stats.ttest_ind(
control_metrics['ctr'],
treatment_metrics['ctr']
)
if p_value < 0.05:
improvement = (treatment_metrics['ctr'].mean() - control_metrics['ctr'].mean()) / control_metrics['ctr'].mean()
print(f"Treatment is {improvement*100:.1f}% better (p={p_value:.4f})")
else:
print("No significant difference")
20. Deep Dive: Multi-Armed Bandits for Dynamic Model Selection
Problem: A/B tests are slow (weeks to reach significance). Can we adapt faster?
Solution: Multi-Armed Bandits (Thompson Sampling).
Concept:
- Start with equal traffic to all models.
- Gradually shift traffic to the best-performing model.
- Result: Maximize reward while exploring.
Implementation:
import numpy as np
class ThompsonSamplingBandit:
def __init__(self, n_models=3):
# Beta distribution parameters (successes, failures)
self.alpha = np.ones(n_models) # Prior: 1 success
self.beta = np.ones(n_models) # Prior: 1 failure
def select_model(self):
# Sample from Beta distributions
samples = [np.random.beta(self.alpha[i], self.beta[i]) for i in range(len(self.alpha))]
return np.argmax(samples)
def update(self, model_id, reward):
if reward == 1: # Success (user clicked)
self.alpha[model_id] += 1
else: # Failure
self.beta[model_id] += 1
# Usage
bandit = ThompsonSamplingBandit(n_models=3)
for _ in range(1000): # 1000 requests
model_id = bandit.select_model()
prediction = models[model_id].predict(features)
# Get user feedback (click = 1, no click = 0)
reward = get_user_feedback()
bandit.update(model_id, reward)
print("Final traffic distribution:")
print(f"Model 1: {bandit.alpha[0] / (bandit.alpha[0] + bandit.beta[0])}")
print(f"Model 2: {bandit.alpha[1] / (bandit.alpha[1] + bandit.beta[1])}")
print(f"Model 3: {bandit.alpha[2] / (bandit.alpha[2] + bandit.beta[2])}")
Benefit: Converges to the best model in days instead of weeks.
Key Takeaways
- Replication is Essential: Single-server deployments are fragile.
- Active-Active is Preferred: Maximizes resource utilization and throughput.
- Canary Deployments: Reduce risk when rolling out new models.
- Stateless is Simpler: Stateful models require sticky sessions or shared state stores.
- Kubernetes is King: Industry standard for orchestrating model replicas.
Summary
| Aspect | Insight |
|---|---|
| Goal | High availability, low latency, fault tolerance |
| Strategies | Active-Active, Active-Passive, Geo-Replication |
| Deployment | Canary, Blue-Green, Rolling Updates |
| Challenges | State management, version sync, cross-region costs |
FAQ
What is the difference between active-active and active-passive model replication?
Active-active replication has all replicas serving traffic simultaneously, maximizing throughput and resource utilization. Active-passive has one primary serving traffic while standbys remain idle, providing simpler implementation but wasting resources. Active-active is preferred for production ML systems because inference is typically stateless, making synchronization trivial.
How does canary deployment reduce risk when updating ML models?
Canary deployment routes only 1 percent of traffic to the new model while 99 percent continues on the old model. If the canary shows healthy metrics (error rate, latency, and business metrics like CTR), traffic is gradually increased to 10, 50, then 100 percent. If any metric degrades, an automatic rollback routes all traffic back to the previous version immediately.
How do you ensure all model replicas serve the same version?
Use a model registry that tracks version, timestamp, accuracy metrics, and deployment status for every model. Deploy atomically by having all replicas pull the same version from a central store like S3, verified by SHA-256 checksums. Event-driven deployment via Kafka provides real-time update notifications, while polling with jitter prevents thundering herd problems when many replicas check for updates simultaneously.
How do you handle stateful models like chatbots across replicas?
Two approaches exist. Sticky sessions route all requests from the same user to the same replica using IP hashing, but sessions are lost if a replica dies. The preferred approach is a shared state store like Redis where all replicas read and write session state, keeping the replicas themselves stateless. This allows routing any request to any replica while maintaining conversation context.
Originally published at: arunbaby.com/ml-system-design/0033-model-replication-systems
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch