Model Evaluation Metrics
How to measure if your ML model is actually good, choosing the right metrics is as important as building the model itself.
Introduction
Model evaluation metrics are quantitative measures of model performance. Choosing the wrong metric can lead to models that optimize for the wrong objective.
Why metrics matter:
- Define success: What does “good” mean for your model?
- Compare models: Which of 10 models should you deploy?
- Monitor production: Detect when model degrades
- Align with business: ML metrics must connect to business KPIs
What you’ll learn:
- Classification metrics (accuracy, precision, recall, F1, ROC-AUC)
- Regression metrics (MSE, MAE, R²)
- Ranking metrics (NDCG, MAP, MRR)
- Choosing the right metric for your problem
- Production monitoring strategies
Classification Metrics
Binary Classification
Confusion Matrix: Foundation of all classification metrics.
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
TP: True Positive - Correctly predicted positive
TN: True Negative - Correctly predicted negative
FP: False Positive - Incorrectly predicted positive (Type I error)
FN: False Negative - Incorrectly predicted negative (Type II error)
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
When to use: Balanced datasets
When NOT to use: Imbalanced datasets
Example:
from sklearn.metrics import accuracy_score
y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 1, 0, 0, 1, 0, 1]
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy:.2%}") # 75.00%
Accuracy Paradox:
# Dataset: 95% negative, 5% positive (highly imbalanced)
# Model always predicts negative → 95% accurate!
# But useless for detecting positive class
Precision
Precision = TP / (TP + FP)
Interpretation: Of all positive predictions, how many were actually positive?
When to use: Cost of false positives is high
Example: Email spam detection (don’t mark legitimate emails as spam)
Recall (Sensitivity, True Positive Rate)
Recall = TP / (TP + FN)
Interpretation: Of all actual positives, how many did we detect?
When to use: Cost of false negatives is high
Example: Cancer detection (don’t miss actual cases)
F1 Score
F1 = 2 * (Precision * Recall) / (Precision + Recall)
Interpretation: Harmonic mean of precision and recall
When to use: Need balance between precision and recall
Implementation:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 0, 1, 1, 0]
# Compute metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.2%}")
print(f"Recall: {recall:.2%}")
print(f"F1 Score: {f1:.2%}")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"Confusion Matrix:\n{cm}")
ROC Curve & AUC
ROC (Receiver Operating Characteristic): Plot of True Positive Rate vs False Positive Rate at different thresholds.
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
import numpy as np
# Predicted probabilities
y_true = [0, 0, 1, 1, 0, 1, 0, 1]
y_scores = [0.1, 0.4, 0.35, 0.8, 0.2, 0.9, 0.3, 0.7]
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
# Compute AUC
auc = roc_auc_score(y_true, y_scores)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
print(f"AUC: {auc:.3f}")
AUC Interpretation:
- 1.0: Perfect classifier
- 0.5: Random classifier
- < 0.5: Worse than random (inverted predictions)
When to use AUC: When you want threshold-independent performance measure
Precision-Recall Curve
Better than ROC for imbalanced datasets.
from sklearn.metrics import precision_recall_curve, average_precision_score
import numpy as np
# Compute precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
# Average precision
avg_precision = average_precision_score(y_true, y_scores)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'PR Curve (AP = {avg_precision:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()
Multi-Class Classification
Macro vs Micro Averaging:
from sklearn.metrics import classification_report
y_true = [0, 1, 2, 0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 1, 2, 0, 2, 2]
# Classification report
report = classification_report(y_true, y_pred, target_names=['Class A', 'Class B', 'Class C'])
print(report)
Macro Average: Average of per-class metrics (treats all classes equally)
Micro Average: Aggregate TP, FP, FN across all classes (favors frequent classes)
Weighted Average: Weighted by class frequency
When to use which:
- Macro: All classes equally important
- Micro: Overall performance across all predictions
- Weighted: Account for class imbalance
Regression Metrics
Mean Squared Error (MSE)
MSE = (1/n) * Σ(y_true - y_pred)²
Properties:
- Penalizes large errors heavily (squared term)
- Always non-negative
- Same units as y²
from sklearn.metrics import mean_squared_error
import numpy as np
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: {mse:.4f}")
Root Mean Squared Error (RMSE)
RMSE = √MSE
Properties:
- Same units as y (interpretable)
- Sensitive to outliers
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.4f}")
Mean Absolute Error (MAE)
MAE = (1/n) * Σ|y_true - y_pred|
Properties:
- Linear penalty (all errors weighted equally)
- More robust to outliers than MSE
- Same units as y
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: {mae:.4f}")
MSE vs MAE:
- Use MSE when large errors are especially bad
- Use MAE when all errors have equal weight
R² Score (Coefficient of Determination)
R² = 1 - (SS_res / SS_tot)
where:
SS_res = Σ(y_true - y_pred)² (residual sum of squares)
SS_tot = Σ(y_true - y_mean)² (total sum of squares)
Interpretation:
- 1.0: Perfect predictions
- 0.0: Model performs as well as predicting mean
- < 0.0: Model worse than predicting mean
from sklearn.metrics import r2_score
import numpy as np
r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")
Mean Absolute Percentage Error (MAPE)
MAPE = (100/n) * Σ|((y_true - y_pred) / y_true)|
When to use: When relative error matters more than absolute error
Caveat: Undefined when y_true = 0
def mean_absolute_percentage_error(y_true, y_pred):
"""
MAPE implementation
Warning: Undefined when y_true contains zeros
"""
y_true, y_pred = np.array(y_true), np.array(y_pred)
# Avoid division by zero
non_zero_mask = y_true != 0
if not np.any(non_zero_mask):
return np.inf
return np.mean(np.abs((y_true[non_zero_mask] - y_pred[non_zero_mask]) / y_true[non_zero_mask])) * 100
y_true = [100, 200, 150, 300]
y_pred = [110, 190, 160, 280]
mape = mean_absolute_percentage_error(y_true, y_pred)
print(f"MAPE: {mape:.2f}%")
Ranking Metrics
For recommendation systems, search engines, etc.
Normalized Discounted Cumulative Gain (NDCG)
Measures quality of ranking where position matters.
from sklearn.metrics import ndcg_score
# Relevance scores for each item (higher = more relevant)
# Order matters: first item is ranked first, etc.
y_true = [[3, 2, 3, 0, 1, 2]] # True relevance
y_pred = [[2.8, 1.9, 2.5, 0.1, 1.2, 1.8]] # Predicted scores
# NDCG@k for different k values
for k in [3, 5, None]: # None means all items
ndcg = ndcg_score(y_true, y_pred, k=k)
label = f"NDCG@{k if k else 'all'}"
print(f"{label}: {ndcg:.4f}")
Interpretation:
- 1.0: Perfect ranking
- 0.0: Worst possible ranking
When to use: Position-aware ranking (search, recommendations)
Mean Average Precision (MAP)
def average_precision(y_true, y_scores):
"""
Compute Average Precision
Args:
y_true: Binary relevance (1 = relevant, 0 = not relevant)
y_scores: Predicted scores
Returns:
Average precision
"""
# Sort by scores (descending)
sorted_indices = np.argsort(y_scores)[::-1]
y_true_sorted = np.array(y_true)[sorted_indices]
# Compute precision at each relevant item
precisions = []
num_relevant = 0
for i, is_relevant in enumerate(y_true_sorted, 1):
if is_relevant:
num_relevant += 1
precision_at_i = num_relevant / i
precisions.append(precision_at_i)
if not precisions:
return 0.0
return np.mean(precisions)
# Example
y_true = [1, 0, 1, 0, 1, 0]
y_scores = [0.9, 0.8, 0.7, 0.6, 0.5, 0.4]
ap = average_precision(y_true, y_scores)
print(f"Average Precision: {ap:.4f}")
Mean Reciprocal Rank (MRR)
Measures where the first relevant item appears.
MRR = (1/|Q|) * Σ(1 / rank_i)
where rank_i is the rank of first relevant item for query i
def mean_reciprocal_rank(y_true_queries, y_pred_queries):
"""
Compute MRR across multiple queries
Args:
y_true_queries: List of relevance lists (one per query)
y_pred_queries: List of score lists (one per query)
Returns:
MRR score
"""
reciprocal_ranks = []
for y_true, y_scores in zip(y_true_queries, y_pred_queries):
# Sort by scores
sorted_indices = np.argsort(y_scores)[::-1]
y_true_sorted = np.array(y_true)[sorted_indices]
# Find first relevant item
for rank, is_relevant in enumerate(y_true_sorted, 1):
if is_relevant:
reciprocal_ranks.append(1.0 / rank)
break
else:
# No relevant item found
reciprocal_ranks.append(0.0)
return np.mean(reciprocal_ranks)
# Example: 3 queries
y_true_queries = [
[0, 1, 0, 1], # Query 1: first relevant at position 2
[1, 0, 0, 0], # Query 2: first relevant at position 1
[0, 0, 1, 0], # Query 3: first relevant at position 3
]
y_pred_queries = [
[0.2, 0.8, 0.3, 0.9],
[0.9, 0.1, 0.2, 0.3],
[0.1, 0.2, 0.9, 0.3],
]
mrr = mean_reciprocal_rank(y_true_queries, y_pred_queries)
print(f"MRR: {mrr:.4f}")
Choosing the Right Metric
Decision Framework
class MetricSelector:
"""
Help choose appropriate metric based on problem characteristics
"""
def recommend_metric(
self,
task_type: str,
class_balance: str = 'balanced',
business_priority: str = None
) -> list[str]:
"""
Recommend metrics based on problem characteristics
Args:
task_type: 'binary_classification', 'multiclass', 'regression', 'ranking'
class_balance: 'balanced', 'imbalanced'
business_priority: 'precision', 'recall', 'both', None
Returns:
List of recommended metrics
"""
recommendations = []
if task_type == 'binary_classification':
if class_balance == 'balanced':
recommendations.append('Accuracy')
recommendations.append('ROC-AUC')
else:
recommendations.append('Precision-Recall AUC')
recommendations.append('F1 Score')
if business_priority == 'precision':
recommendations.append('Precision (optimize threshold)')
elif business_priority == 'recall':
recommendations.append('Recall (optimize threshold)')
elif business_priority == 'both':
recommendations.append('F1 Score')
elif task_type == 'multiclass':
recommendations.append('Macro F1 (if classes equally important)')
recommendations.append('Weighted F1 (if accounting for imbalance)')
recommendations.append('Confusion Matrix (for detailed analysis)')
elif task_type == 'regression':
recommendations.append('RMSE (if penalizing large errors)')
recommendations.append('MAE (if robust to outliers)')
recommendations.append('R² (for explained variance)')
elif task_type == 'ranking':
recommendations.append('NDCG (for position-aware ranking)')
recommendations.append('MAP (for information retrieval)')
recommendations.append('MRR (for first relevant item)')
return recommendations
# Usage
selector = MetricSelector()
# Example 1: Fraud detection (imbalanced, recall critical)
metrics = selector.recommend_metric(
task_type='binary_classification',
class_balance='imbalanced',
business_priority='recall'
)
print("Fraud detection metrics:", metrics)
# Example 2: Search ranking
metrics = selector.recommend_metric(
task_type='ranking'
)
print("Search ranking metrics:", metrics)
Production Monitoring
Metric Tracking System
import time
from collections import deque
from typing import Dict, List
class MetricTracker:
"""
Track metrics over time in production
Use case: Monitor model performance degradation
"""
def __init__(self, window_size=1000):
self.window_size = window_size
# Sliding windows for predictions and actuals
self.predictions = deque(maxlen=window_size)
self.actuals = deque(maxlen=window_size)
self.timestamps = deque(maxlen=window_size)
# Historical metrics
self.metric_history = {
'accuracy': [],
'precision': [],
'recall': [],
'f1': [],
'timestamp': []
}
def log_prediction(self, y_true, y_pred):
"""
Log a prediction and its actual outcome
"""
self.predictions.append(y_pred)
self.actuals.append(y_true)
self.timestamps.append(time.time())
def compute_current_metrics(self) -> Dict:
"""
Compute metrics over current window
"""
if len(self.predictions) < 10:
return {}
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
try:
metrics = {
'accuracy': accuracy_score(self.actuals, self.predictions),
'precision': precision_score(self.actuals, self.predictions, zero_division=0),
'recall': recall_score(self.actuals, self.predictions, zero_division=0),
'f1': f1_score(self.actuals, self.predictions, zero_division=0),
'sample_count': len(self.predictions)
}
# Save to history
for metric_name, value in metrics.items():
if metric_name != 'sample_count':
self.metric_history[metric_name].append(value)
self.metric_history['timestamp'].append(time.time())
return metrics
except Exception as e:
print(f"Error computing metrics: {e}")
return {}
def detect_degradation(self, baseline_metric: str = 'f1', threshold: float = 0.05) -> bool:
"""
Detect if model performance has degraded
Args:
baseline_metric: Metric to monitor
threshold: Alert if metric drops by this much from baseline
Returns:
True if degradation detected
"""
history = self.metric_history.get(baseline_metric, [])
if len(history) < 10:
return False
# Compare recent average to baseline (first 10% of history)
baseline_size = max(10, len(history) // 10)
baseline_avg = np.mean(history[:baseline_size])
recent_avg = np.mean(history[-baseline_size:])
degradation = baseline_avg - recent_avg
return degradation > threshold
# Usage
tracker = MetricTracker(window_size=1000)
# Simulate predictions over time
for i in range(1500):
# Simulate ground truth and prediction
y_true = np.random.choice([0, 1], p=[0.7, 0.3])
# Simulate model getting worse over time
accuracy_degradation = min(0.1, i / 10000)
if np.random.random() < (0.8 - accuracy_degradation):
y_pred = y_true
else:
y_pred = 1 - y_true
tracker.log_prediction(y_true, y_pred)
# Compute metrics every 100 predictions
if i % 100 == 0 and i > 0:
metrics = tracker.compute_current_metrics()
if metrics:
print(f"Step {i}: F1 = {metrics['f1']:.3f}")
if tracker.detect_degradation():
print(f"⚠️ WARNING: Model degradation detected at step {i}")
Model Calibration
Calibration: How well predicted probabilities match actual outcomes.
Example of poor calibration:
# Model predicts 80% probability for 100 samples
# Only 40 of them are actually positive
# Model is overconfident! (80% predicted vs 40% actual)
Calibration Plot
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt
def plot_calibration_curve(y_true, y_prob, n_bins=10):
"""
Plot calibration curve
A well-calibrated model's curve follows the diagonal
"""
prob_true, prob_pred = calibration_curve(
y_true,
y_prob,
n_bins=n_bins,
strategy='uniform'
)
plt.figure(figsize=(8, 6))
plt.plot(prob_pred, prob_true, marker='o', label='Model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect Calibration')
plt.xlabel('Mean Predicted Probability')
plt.ylabel('Fraction of Positives')
plt.title('Calibration Plot')
plt.legend()
plt.grid(True)
plt.show()
# Example
y_true = [0, 1, 1, 0, 1, 0, 1, 1, 0, 1] * 10 # 100 samples
y_prob = [0.2, 0.7, 0.8, 0.3, 0.9, 0.1, 0.6, 0.85, 0.15, 0.75] * 10
plot_calibration_curve(y_true, y_prob)
Calibrating Models
Some models (e.g., SVMs, tree ensembles) output poorly calibrated probabilities.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
# Train base model
base_model = RandomForestClassifier()
base_model.fit(X_train, y_train)
# Calibrate predictions
calibrated_model = CalibratedClassifierCV(
base_model,
method='sigmoid', # or 'isotonic'
cv=5
)
calibrated_model.fit(X_train, y_train)
# Now probabilities are better calibrated
y_prob_calibrated = calibrated_model.predict_proba(X_test)[:, 1]
Calibration methods:
- Platt scaling (sigmoid): Fits logistic regression on predictions
- Isotonic regression: Non-parametric, more flexible but needs more data
Threshold Tuning
Classification models output probabilities. Choosing the decision threshold impacts precision/recall trade-off.
Finding Optimal Threshold
import numpy as np
from sklearn.metrics import precision_recall_curve, f1_score
def find_optimal_threshold(y_true, y_prob, metric='f1'):
"""
Find threshold that maximizes a metric
Args:
y_true: True labels
y_prob: Predicted probabilities
metric: 'f1', 'precision', 'recall', or custom function
Returns:
optimal_threshold, best_score
"""
if metric == 'f1':
# Compute F1 at different thresholds
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
f1_scores = 2 * (precision * recall) / (precision + recall + 1e-10)
best_idx = np.argmax(f1_scores)
return thresholds[best_idx] if best_idx < len(thresholds) else 0.5, f1_scores[best_idx]
elif metric == 'precision':
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
# Find threshold for minimum acceptable recall (e.g., 0.8)
min_recall = 0.8
valid_idx = recall >= min_recall
if not any(valid_idx):
return None, 0
best_idx = np.argmax(precision[valid_idx])
return thresholds[valid_idx][best_idx], precision[valid_idx][best_idx]
elif metric == 'recall':
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
# Find threshold for minimum acceptable precision (e.g., 0.9)
min_precision = 0.9
valid_idx = precision >= min_precision
if not any(valid_idx):
return None, 0
best_idx = np.argmax(recall[valid_idx])
return thresholds[valid_idx][best_idx], recall[valid_idx][best_idx]
# Example
y_true = np.array([0, 1, 1, 0, 1, 0, 1, 1, 0, 1])
y_prob = np.array([0.2, 0.7, 0.8, 0.3, 0.9, 0.1, 0.6, 0.85, 0.15, 0.75])
optimal_threshold, best_f1 = find_optimal_threshold(y_true, y_prob, metric='f1')
print(f"Optimal threshold: {optimal_threshold:.3f}, Best F1: {best_f1:.3f}")
Threshold Selection Strategies
1. Maximize F1 Score
- Balanced precision and recall
- Good default choice
2. Business-Driven
# Example: Fraud detection
# False negative (missed fraud) costs $500
# False positive (declined legit transaction) costs $10
def business_value_threshold(y_true, y_prob, fn_cost=500, fp_cost=10):
"""
Find threshold that maximizes business value
"""
best_threshold = 0.5
best_value = float('-inf')
for threshold in np.arange(0.1, 0.9, 0.01):
y_pred = (y_prob >= threshold).astype(int)
# Compute confusion matrix
tn = ((y_true == 0) & (y_pred == 0)).sum()
fp = ((y_true == 0) & (y_pred == 1)).sum()
fn = ((y_true == 1) & (y_pred == 0)).sum()
tp = ((y_true == 1) & (y_pred == 1)).sum()
# Business value = savings from catching fraud - cost of false alarms
value = tp * fn_cost - fp * fp_cost
if value > best_value:
best_value = value
best_threshold = threshold
return best_threshold, best_value
threshold, value = business_value_threshold(y_true, y_prob)
print(f"Best threshold: {threshold:.2f}, Business value: ${value:.2f}")
3. Operating Point Selection
# Healthcare: Prioritize recall (don't miss diseases)
# Set minimum recall = 0.95, maximize precision subject to that
def threshold_for_min_recall(y_true, y_prob, min_recall=0.95):
"""Find threshold that achieves minimum recall while maximizing precision"""
precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
valid_indices = recall >= min_recall
if not any(valid_indices):
return None
best_precision_idx = np.argmax(precision[valid_indices])
threshold_idx = np.where(valid_indices)[0][best_precision_idx]
return thresholds[threshold_idx] if threshold_idx < len(thresholds) else 0.0
Handling Imbalanced Datasets
Why Standard Metrics Fail
# Dataset: 99% negative, 1% positive
y_true = [0] * 990 + [1] * 10
y_pred_dummy = [0] * 1000 # Always predict negative
from sklearn.metrics import accuracy_score, precision_score, recall_score
print(f"Accuracy: {accuracy_score(y_true, y_pred_dummy):.1%}") # 99%!
print(f"Precision: {precision_score(y_true, y_pred_dummy, zero_division=0):.1%}") # Undefined (0/0)
print(f"Recall: {recall_score(y_true, y_pred_dummy):.1%}") # 0%
Accuracy is 99% but model is useless!
Better Metrics for Imbalanced Data
1. Precision-Recall AUC
Better than ROC-AUC for imbalanced data because it doesn’t include TN (which dominates in imbalanced datasets).
from sklearn.metrics import average_precision_score
ap = average_precision_score(y_true, y_scores)
print(f"Average Precision: {ap:.3f}")
2. Cohen’s Kappa
Measures agreement between predicted and actual, adjusted for chance.
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(y_true, y_pred)
print(f"Cohen's Kappa: {kappa:.3f}")
# Interpretation:
# < 0: No agreement
# 0-0.20: Slight
# 0.21-0.40: Fair
# 0.41-0.60: Moderate
# 0.61-0.80: Substantial
# 0.81-1.0: Almost perfect
3. Matthews Correlation Coefficient (MCC)
Takes all four confusion matrix values into account. Ranges from -1 to +1.
from sklearn.metrics import matthews_corrcoef
mcc = matthews_corrcoef(y_true, y_pred)
print(f"MCC: {mcc:.3f}")
# Interpretation:
# +1: Perfect prediction
# 0: Random prediction
# -1: Perfect inverse prediction
4. Class-Weighted Metrics
from sklearn.metrics import fbeta_score
# Emphasize recall (beta > 1) for imbalanced positive class
f2 = fbeta_score(y_true, y_pred, beta=2) # Recall weighted 2x more than precision
print(f"F2 Score: {f2:.3f}")
Sampling Strategies
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as ImbPipeline
# Combine over-sampling and under-sampling
pipeline = ImbPipeline([
('oversample', SMOTE(sampling_strategy=0.5)), # Increase minority to 50% of majority
('undersample', RandomUnderSampler(sampling_strategy=1.0)) # Balance classes
])
X_resampled, y_resampled = pipeline.fit_resample(X_train, y_train)
Aligning ML Metrics with Business KPIs
Example 1: E-commerce Recommendation System
ML Metrics:
- Precision@10: 0.65
- Recall@10: 0.45
- NDCG@10: 0.72
Business KPIs:
- Click-through rate (CTR): 3.5%
- Conversion rate: 1.2%
- Revenue per user: $45
Alignment:
class BusinessMetricTracker:
"""
Track both ML and business metrics
Use case: Connect model performance to business impact
"""
def __init__(self):
self.ml_metrics = {}
self.business_metrics = {}
self.correlations = {}
def log_session(
self,
ml_metrics: dict,
business_metrics: dict
):
"""Log metrics for a user session"""
for metric, value in ml_metrics.items():
if metric not in self.ml_metrics:
self.ml_metrics[metric] = []
self.ml_metrics[metric].append(value)
for metric, value in business_metrics.items():
if metric not in self.business_metrics:
self.business_metrics[metric] = []
self.business_metrics[metric].append(value)
def compute_correlations(self):
"""Compute correlation between ML and business metrics"""
import numpy as np
from scipy.stats import pearsonr
for ml_metric in self.ml_metrics:
for biz_metric in self.business_metrics:
ml_values = np.array(self.ml_metrics[ml_metric])
biz_values = np.array(self.business_metrics[biz_metric])
if len(ml_values) == len(biz_values):
corr, p_value = pearsonr(ml_values, biz_values)
self.correlations[(ml_metric, biz_metric)] = {
'correlation': corr,
'p_value': p_value
}
return self.correlations
# Usage
tracker = BusinessMetricTracker()
# Log multiple sessions
for _ in range(100):
tracker.log_session(
ml_metrics={'precision': np.random.uniform(0.6, 0.7)},
business_metrics={'ctr': np.random.uniform(0.03, 0.04)}
)
correlations = tracker.compute_correlations()
print("ML Metric ↔ Business KPI Correlations:")
for (ml, biz), stats in correlations.items():
print(f"{ml} ↔ {biz}: r={stats['correlation']:.3f}, p={stats['p_value']:.3f}")
Example 2: Content Moderation
ML Metrics:
- Precision: 0.92 (92% of flagged content is actually bad)
- Recall: 0.78 (catch 78% of bad content)
Business KPIs:
- User reports: How many users still report bad content?
- User retention: Are false positives causing users to leave?
- Moderator workload: Hours spent reviewing flagged content
Trade-off:
- High recall → More bad content caught → Fewer user reports ✓
- But also → More false positives → Higher moderator workload ✗
def estimate_moderator_cost(precision, recall, daily_content, hourly_rate=50):
"""
Estimate cost of content moderation
Args:
precision: Model precision
recall: Model recall
daily_content: Number of content items per day
hourly_rate: Cost per moderator hour
Returns:
Daily moderation cost
"""
# Assume 1% of content is actually bad
bad_content = daily_content * 0.01
# Content flagged by model
flagged = (bad_content * recall) / precision
# Time to review (assume 30 seconds per item)
review_hours = (flagged * 30) / 3600
# Cost
cost = review_hours * hourly_rate
return cost, review_hours
# Compare different models
models = [
{'name': 'Conservative', 'precision': 0.95, 'recall': 0.70},
{'name': 'Balanced', 'precision': 0.90, 'recall': 0.80},
{'name': 'Aggressive', 'precision': 0.85, 'recall': 0.90}
]
for model in models:
cost, hours = estimate_moderator_cost(
model['precision'],
model['recall'],
daily_content=100000
)
print(f"{model['name']}: ${cost:.2f}/day, {hours:.1f} hours/day")
Common Pitfalls
Pitfall 1: Data Leakage in Evaluation
# WRONG: Fit preprocessing on entire dataset
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Leakage! Test data info leaks into training
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# CORRECT: Fit only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on train only
X_test_scaled = scaler.transform(X_test) # Transform test
Pitfall 2: Using Wrong Metric for Problem
# Wrong: Using accuracy for imbalanced fraud detection
# Fraud rate: 0.1%, model always predicts "not fraud"
# Accuracy: 99.9% ✓ (misleading!)
# Recall: 0% ✗ (useless!)
# Right: Use precision-recall, F1, or PR-AUC
Pitfall 3: Ignoring Confidence Intervals
# Model A: Accuracy = 85.2%
# Model B: Accuracy = 85.5%
# Is B really better? Need confidence intervals!
from scipy import stats
def accuracy_confidence_interval(y_true, y_pred, confidence=0.95):
"""Compute confidence interval for accuracy"""
n = len(y_true)
y_true = np.array(y_true)
y_pred = np.array(y_pred)
accuracy = (y_true == y_pred).sum() / n
# Wilson score interval
z = stats.norm.ppf((1 + confidence) / 2)
denominator = 1 + z**2 / n
center = (accuracy + z**2 / (2*n)) / denominator
margin = z * np.sqrt(accuracy * (1 - accuracy) / n + z**2 / (4 * n**2)) / denominator
return center - margin, center + margin
import numpy as np
# Example toy predictions for illustration
y_true_a = np.random.randint(0, 2, size=1000)
y_pred_a = np.random.randint(0, 2, size=1000)
y_true_b = np.random.randint(0, 2, size=1000)
y_pred_b = np.random.randint(0, 2, size=1000)
ci_a = accuracy_confidence_interval(y_true_a, y_pred_a)
acc_a = (y_true_a == y_pred_a).mean() * 100
print(f"Model A: {acc_a:.1f}% [{ci_a[0]*100:.1f}%, {ci_a[1]*100:.1f}%]")
ci_b = accuracy_confidence_interval(y_true_b, y_pred_b)
acc_b = (y_true_b == y_pred_b).mean() * 100
print(f"Model B: {acc_b:.1f}% [{ci_b[0]*100:.1f}%, {ci_b[1]*100:.1f}%]")
# If intervals overlap significantly, difference may not be meaningful
Pitfall 4: Overfitting to Validation Set
# WRONG: Repeatedly tuning on same validation set
for _ in range(100): # Many iterations
model = train_model(X_train, y_train, hyperparams)
val_score = evaluate(model, X_val, y_val)
hyperparams = adjust_based_on_score(val_score) # Overfitting to val!
# CORRECT: Use nested cross-validation or holdout test set
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2)
# Tune on train_full (with inner CV)
best_model = grid_search_cv(X_train_full, y_train_full)
# Evaluate ONCE on test set
final_score = evaluate(best_model, X_test, y_test)
Connection to Speech Systems
Model evaluation principles apply directly to speech/audio ML systems:
TTS Quality Metrics
Objective Metrics:
- Mel Cepstral Distortion (MCD): Similar to MSE for regression
- F0 RMSE: Pitch prediction error
- Duration Accuracy: Similar to classification metrics for boundary detection
Subjective Metrics:
- Mean Opinion Score (MOS): Like human evaluation for content moderation
- Must have confidence intervals: Just like accuracy CIs above
ASR Error Metrics
Word Error Rate (WER):
WER = (S + D + I) / N
S: Substitutions
D: Deletions
I: Insertions
N: Total words in reference
Similar to precision/recall trade-off:
- High substitutions → Low precision (predicting wrong words)
- High deletions → Low recall (missing words)
Speaker Verification
Uses same binary classification metrics:
- EER (Equal Error Rate): Point where FPR = FNR
- DCF (Detection Cost Function): Business-driven threshold (like threshold tuning above)
def compute_eer(y_true, y_scores):
"""
Compute Equal Error Rate for speaker verification
Similar to finding optimal threshold
"""
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_true, y_scores)
fnr = 1 - tpr
# Find where FPR ≈ FNR
eer_idx = np.argmin(np.abs(fpr - fnr))
eer = (fpr[eer_idx] + fnr[eer_idx]) / 2
return eer, thresholds[eer_idx]
# Example: Speaker verification scores
y_true = [1, 1, 1, 0, 0, 0, 1, 1, 0, 0]
y_scores = [0.9, 0.85, 0.7, 0.4, 0.3, 0.2, 0.8, 0.75, 0.5, 0.35]
eer, eer_threshold = compute_eer(y_true, y_scores)
print(f"EER: {eer:.2%} at threshold {eer_threshold:.3f}")
Key Takeaways
✅ No single best metric - choice depends on problem and business context
✅ Accuracy misleading for imbalanced datasets - use precision/recall/F1
✅ ROC-AUC good for threshold-independent evaluation
✅ Precision-Recall better than ROC for imbalanced data
✅ Regression metrics - MSE for outlier sensitivity, MAE for robustness
✅ Ranking metrics - NDCG for position-aware, MRR for first relevant item
✅ Production monitoring - track metrics over time to detect degradation
✅ Align with business - metrics must connect to business KPIs
Originally published at: arunbaby.com/ml-system-design/0006-model-evaluation-metrics
If you found this helpful, consider sharing it with others who might benefit.