14 minute read

Not everything needs to be real-time. Sometimes, “tomorrow morning” is fast enough.

TL;DR

Batch processing handles 90 percent of ML workloads including training, feature engineering, reporting, and backfilling. The modern data stack combines Airflow for orchestration, Spark for distributed computation, and Parquet for columnar storage. Key engineering challenges include the small files problem (fix with compaction), data skew in joins (fix with salting or broadcast joins), late-arriving data (fix with watermarks or Delta Lake upserts), and ensuring idempotency for safe retries. See ML pipeline orchestration for the dependency management layer that sits on top of batch processing, and cost optimization for reducing compute spend.

A massive industrial pipeline system with valves

The Case for Batch

In the age of Real-Time Streaming (Kafka, Flink), Batch Processing feels archaic. But 90% of ML workloads are still Batch.

  • Training: You train on a dataset (Batch), not a stream.
  • Reporting: “Daily Active Users” is a batch metric.
  • Backfilling: Re-computing history requires batch.

Batch vs. Stream:

  • Batch: High Throughput, High Latency. (Process 1TB in 1 hour).
  • Stream: Low Throughput, Low Latency. (Process 1 event in 10ms).

Architecture: The Modern Data Stack

  1. Ingestion: Fivetran / Airbyte. Pull data from Postgres/Salesforce into the Warehouse.
  2. Storage: Data Lake (S3/GCS) or Data Warehouse (Snowflake/BigQuery).
  3. Transformation: dbt (SQL) or Spark (Python).
  4. Orchestration: Airflow / Dagster / Prefect.

High-Level Architecture: The Modern Data Stack

+-----------+ +------------+ +-------------+ +-------------+
| Sources | --> | Ingestion | --> | Data Lake | --> | Warehouse |
+-----------+ +------------+ +-------------+ +-------------+
(Postgres) (Fivetran) (S3 / GCS) (Snowflake)
 |
 v
+-----------+ +------------+ +-------------+ +-------------+
| Dashboard | <-- | Serving | <-- | Transform | <-- | Orchestrator|
+-----------+ +------------+ +-------------+ +-------------+
(Tableau) (Redis/API) (dbt / Spark) (Airflow)

Deep Dive: Apache Airflow (Orchestration)

Airflow allows you to define pipelines as code (Python). DAG (Directed Acyclic Graph): A collection of tasks with dependencies.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def extract():
    print("Extracting data from S3...")

def transform():
    print("Running Spark Job...")

def load():
    print("Loading into Feature Store...")

    with DAG("daily_training_pipeline", start_date=datetime(2023, 1, 1)) as dag:
        t1 = PythonOperator(task_id="extract", python_callable=extract)
        t2 = PythonOperator(task_id="transform", python_callable=transform)
        t3 = PythonOperator(task_id="load", python_callable=load)

        t1 >> t2 >> t3 # Define dependencies

Key Concepts:

  • Scheduler: Monitors time and triggers DAGs.
  • Executor: Runs the tasks (Local, Celery, Kubernetes).
  • Backfill: Rerunning the DAG for past dates (e.g., “Run for all of 2022”).

Deep Dive: Apache Spark (Processing)

When data doesn’t fit in RAM (Pandas), you need Spark. Spark is a Distributed Computing Engine.

RDD (Resilient Distributed Dataset):

  • Distributed: Data is split into partitions across nodes.
  • Resilient: If a node fails, Spark rebuilds the partition using the lineage graph.
  • Lazy Evaluation: Transformations (map, filter) are not executed until an Action (count, save) is called.

PySpark Example:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("FeatureEng").getOrCreate()

# Read 1TB of logs
df = spark.read.json("s3://bucket/logs/*.json")

# Group By User and Count Clicks
user_features = df.groupBy("user_id").count()

# Write to Parquet
user_features.write.parquet("s3://bucket/features/")

System Design: Retraining Pipeline

Scenario: Design a system to retrain the Recommendation Model every week.

Components:

  1. Trigger: Airflow DAG runs every Sunday at 00:00.
  2. Data Prep: Spark job reads last 30 days of clicks from Data Lake. Joins with User Table. Saves training_data.parquet.
  3. Validation: Great Expectations checks for nulls/outliers.
  4. Training: SageMaker Training Job launches a GPU instance. Loads parquet. Trains XGBoost. Saves model.tar.gz.
  5. Evaluation: Load model. Predict on Holdout Set. If AUC > 0.8, tag as Production.
  6. Deployment: Update the SageMaker Endpoint to point to the new model artifact.

Engineering: The “Small Files” Problem

A common mistake in Data Lakes: Saving millions of tiny files (1KB each). Why is it bad?

  • S3 API Costs: You pay per PUT/GET request.
  • Spark Slowness: Listing millions of files takes forever. Opening a file has overhead.

Solution: Compaction. Run a nightly job to merge small files into larger files (128MB - 1GB). df.repartition(10).write...

Deep Dive: Lambda vs. Kappa Architecture

How do you combine Batch (Accuracy) and Stream (Speed)?

1. Lambda Architecture (The Old Way):

  • Speed Layer: Kafka + Flink. Provides low-latency, approximate results.
  • Batch Layer: Hadoop/Spark. Provides high-latency, accurate results.
  • Serving Layer: Merges the two.
  • Pros: Robust. If Stream fails, Batch fixes it.
  • Cons: Maintenance Nightmare. You write logic twice (Java for Flink, Python for Spark).

2. Kappa Architecture (The New Way):

  • Everything is a Stream.
  • Batch is just a stream of bounded data.
  • Use Flink for both.
  • Pros: Single codebase.
  • Cons: Reprocessing history is harder (requires replaying the Kafka topic).

Engineering: File Formats (Parquet vs. Avro vs. ORC)

CSV/JSON are terrible for Big Data (Slow parsing, no schema, large size).

1. Parquet (Columnar):

  • Best for: Analytics (OLAP). “Select average(age) from users”.
  • Why: It only reads the “age” column from disk. Skips the rest.
  • Compression: Snappy/Gzip. Very efficient.

2. Avro (Row-based):

  • Best for: Write-heavy workloads, Kafka messages.
  • Why: Schema evolution is first-class. Good for appending data.

3. ORC (Optimized Row Columnar):

  • Best for: Hive/Presto. Similar to Parquet but optimized for the Hadoop ecosystem.

Deep Dive: Partitioning and Bucketing

How do you make SELECT * FROM logs WHERE date = '2023-01-01' fast?

1. Partitioning:

  • Organize folders by key.
  • s3://bucket/logs/date=2023-01-01/part-001.parquet
  • Spark automatically “prunes” partitions. It only scans the relevant folder.
  • Warning: Don’t partition by high-cardinality columns (UserID). You’ll get millions of tiny folders.

2. Bucketing:

  • Hash the key to N buckets.
  • hash(user_id) % 100.
  • Useful for Joins. If two tables are bucketed by user_id, the join is a “Sort-Merge Join” (no shuffle needed).

System Design: Handling Late Data

In Batch, “Daily” doesn’t mean “Midnight to Midnight”. Data arrives late (mobile device offline).

Strategies:

  1. Watermark: Wait for X hours (e.g., process “Yesterday” at 2 AM today).
  2. Lookback: When processing “Today”, also re-process “Yesterday” to catch late arrivals.
  3. Delta Lake / Hudi: These “Lakehouse” formats allow Upserts. You can update yesterday’s partition without rewriting the whole table.

Deep Dive: Idempotency

Definition: f(f(x)) = f(x). Running the job twice produces the same result. Why: Airflow will retry your job if it fails.

Anti-Pattern: INSERT INTO table SELECT ... (Running twice duplicates data).

Pattern: INSERT OVERWRITE table PARTITION (date='2023-01-01') SELECT ... (Running twice overwrites the partition).

Engineering: Data Quality with Great Expectations

The Nightmare: The upstream team changes age from “Years” (Int) to “Birthdate” (String). Your Spark job crashes at 3 AM.

Solution: Circuit Breakers. Use Great Expectations to validate data before processing.

import great_expectations as ge

df = ge.read_parquet("s3://bucket/input/")

# Define expectations
df.expect_column_values_to_be_not_null("user_id")
df.expect_column_values_to_be_between("age", 0, 120)
df.expect_table_row_count_to_be_between(10000, 100000)

# Validate
results = df.validate()
if not results["success"]:
    raise ValueError("Data Quality Check Failed!")

Appendix B: Workflow Orchestration Wars

Airflow:

  • Pros: Industry standard, huge community, Python.
  • Cons: Scheduling latency, complex setup, “The Scheduler Loop”.

Prefect:

  • Pros: “Negative Engineering” (handles retries/failures elegantly), hybrid execution model.
  • Cons: Smaller ecosystem.

Dagster:

  • Pros: Data-aware (knows about assets, not just tasks), strong typing.
  • Cons: Steep learning curve.

Appendix C: Interview Questions

  1. Q: “What is the difference between Transformation and Action in Spark?” A: Transformations (Map, Filter) are lazy (build the DAG). Actions (Count, Collect) trigger execution.

  2. Q: “How do you handle Skewed Data in a Join?” A:
    • Salting: Add a random number (salt) to the skewed key to split it.
    • Broadcast Join: If one table is small, broadcast it to all nodes.
  3. Q: “Explain the difference between Data Warehouse and Data Lake.” A:
    • Warehouse (Snowflake): Structured, Schema-on-Write, SQL, Expensive.
    • Lake (S3): Unstructured/Semi-structured, Schema-on-Read, Files, Cheap.
    • Lakehouse: Best of both (ACID on S3).

Deep Dive: Spark Catalyst Optimizer

Why is Spark SQL faster than raw RDDs? Catalyst. It’s an extensible query optimizer.

Phases:

  1. Analysis: Resolve column names (SELECT name FROM users).
  2. Logical Optimization:
    • Predicate Pushdown: Move FILTER before JOIN.
    • Column Pruning: Read only used columns.
  3. Physical Planning: Choose join strategy (Broadcast vs. Sort-Merge).
  4. Code Generation: Generate Java bytecode on the fly (Whole-Stage Code Gen).

Deep Dive: The Shuffle (Timsort)

The bottleneck of any distributed system is the Shuffle. Moving data from Mapper to Reducer over the network.

Sort-Based Shuffle: Spark sorts data on the Mapper side before sending it. It uses Timsort (Hybrid of Merge Sort and Insertion Sort).

  • Complexity: (O(N \log N)).
  • Memory: Efficient. Spills to disk if RAM is full.

Tuning: spark.sql.shuffle.partitions (Default 200).

  • Too low: OOM (Out of Memory).
  • Too high: Too many small files/tasks.

System Design: Data Lineage (OpenLineage)

Problem: “The revenue number is wrong. Where did it come from?” Solution: Data Lineage.

OpenLineage: Standard spec for lineage.

  • Job: “Daily Revenue ETL”.
  • Input: s3://bucket/orders.
  • Output: s3://bucket/revenue.

Marquez: An Open Source lineage server. It visualizes the graph: Postgres -> Spark -> S3 -> Snowflake -> Tableau. If the dashboard breaks, you trace it back to the source.

Engineering: CI/CD for Data Pipelines

Software Engineers have CI/CD. Data Engineers usually test in production. Don’t.

Pipeline:

  1. Unit Test: Test individual Python functions (PyTest).
  2. Integration Test: Run the DAG on a small sample dataset (Dockerized Airflow).
  3. Staging: Deploy to Staging environment. Run on full data (or 10%).
  4. Production: Deploy.

Tools:

  • DataOps: The philosophy of applying DevOps to Data.
  • dbt test: Validates SQL logic.

FinOps: Autoscaling Strategies

Batch jobs are bursty. You need 100 nodes for 1 hour, then 0.

1. Cluster Autoscaling:

  • If pending tasks > 0, add nodes.
  • If CPU utilization < 50%, remove nodes.

2. Spot Instances:

  • Use AWS Spot Instances (90% cheaper).
  • Risk: AWS can reclaim them with 2-minute warning.
  • Mitigation: Spark is fault-tolerant. If a node dies, the driver reschedules the task on another node.

Case Study: Netflix Data Mesh

Netflix moved from a Monolithic Data Lake to a Data Mesh. Principles:

  1. Domain-Oriented Ownership: The “Content Team” owns the “Content Data Product”.
  2. Data as a Product: Data must have SLAs, Documentation, and Quality Checks.
  3. Self-Serve Infrastructure: Platform team provides the tools (Spark/Airflow), Domain teams build the pipelines.
  4. Federated Governance: Global standards (GDPR), local implementation.

Appendix D: The “Small Files” Problem (Revisited)

Compaction Strategies:

  1. Coalesce: df.coalesce(10). Moves data to fewer partitions. No shuffle.
  2. Repartition: df.repartition(10). Full shuffle. Balances data perfectly.
  3. Bin-Packing: Combine small files into a single task during reading (spark.sql.files.maxPartitionBytes).

Appendix E: Advanced Interview Questions

  1. Q: “What is the difference between repartition() and coalesce()?” A: repartition does a full shuffle (network I/O). coalesce just merges local partitions (no shuffle). Use coalesce to reduce file count.

  2. Q: “How do you handle a Hot Key in a Join?” A: If “Justin Bieber” has 100M clicks, the node processing him will OOM.
    • Solution: Salt the key. key = key + random(1, 100). Explode the other table 100 times.
  3. Q: “What is a Broadcast Variable?” A: A read-only variable cached on every machine. Used to send a small lookup table (Country Codes) to all workers to avoid a Shuffle Join.

Deep Dive: Bloom Filters (Probabilistic Data Structures)

Problem: You have 1 Billion URLs. You want to check if a new URL is already in the set. Naive: Store all URLs in a HashSet. (Requires 100GB RAM). Solution: Bloom Filter. (Requires 1GB RAM).

Mechanism:

  1. Bit array of size M.
  2. K hash functions.
  3. Add(item): Hash item K times. Set bits at those indices to 1.
  4. Check(item): Hash item K times. If all bits are 1, return “Maybe Present”. If any bit is 0, return “Definitely Not Present”.

False Positives: Possible. False Negatives: Impossible.

Deep Dive: HyperLogLog (Cardinality Estimation)

Problem: Count unique visitors (DAU) for Facebook (2 Billion users). Naive: SELECT COUNT(DISTINCT user_id). Requires storing all IDs. Slow. Solution: HyperLogLog (HLL).

Mechanism:

  1. Hash the user ID.
  2. Count the number of leading zeros in the binary hash.
  3. If you see a hash with 10 leading zeros, you probably saw 2^10 items.
  4. Average this across many buckets (Harmonic Mean).

Accuracy: 99% accuracy using only 1.5KB of memory.

Deep Dive: Count-Min Sketch (Frequency Estimation)

Problem: Find the “Top 10” most popular songs. Solution: Count-Min Sketch.

Mechanism:

  1. 2D Array [Depth][Width].
  2. Add(item): Hash item Depth times. Increment the counter at [d][hash(item)].
  3. Query(item): Return min(counters) for that item.

Why Min? Because collisions only increase the count. The minimum is the closest to the truth.

Engineering: Handling PII with Hashing/Salting

Requirement: GDPR “Right to be Forgotten”. Problem: If you delete a user from the DB, their ID is still in the logs/backups.

Solution: Crypto-Shredding.

  1. Don’t store user_id. Store HMAC(user_id, key).
  2. Store the key in a separate Key Management Service (KMS).
  3. To “forget” a user, delete their key.
  4. Now the logs contain garbage that can never be decrypted.

Case Study: Uber’s Michelangelo (Batch Training)

Uber built an internal ML-as-a-Service platform. Workflow:

  1. Feature Store: Hive tables containing pre-computed features (avg_ride_cost_7d).
  2. Selection: Data Scientist selects features in UI.
  3. Join: Spark job joins features with labels (Point-in-Time correct).
  4. Train: Distributed XGBoost / Horovod (Deep Learning).
  5. Model Store: Versioned artifact saved to S3.
  6. Serving: Model deployed to a Docker container.

Appendix F: The “Thundering Herd” Problem

Scenario: 10,000 Airflow tasks are scheduled for 00:00. Result: The Scheduler crashes. The Database CPU spikes to 100%.

Solution:

  1. Jitter: Add a random delay (0-60s) to the start time.
  2. Pools: Limit concurrency. pool='heavy_sql', slots=10.
  3. Sensor Deferral: Use SmartSensor (Async) instead of blocking threads.

Conclusion

Batch processing is the workhorse of ML. While real-time is appealing, batch is reliable, replayable, and easy to debug.

FAQ

When should I use batch processing instead of stream processing for ML?

Use batch processing for model training, daily feature computation, reporting, and backfilling historical data. Batch provides high throughput and high latency (process 1TB in 1 hour), while streaming provides low throughput and low latency (process 1 event in 10ms). About 90 percent of ML workloads are batch, and batch is simpler to debug, replay, and maintain.

What is the small files problem in data lakes and how do you fix it?

Saving millions of tiny files to S3 causes high API costs (per PUT/GET request) and slow Spark performance due to file listing and opening overhead. The fix is compaction: run a nightly job to merge small files into 128MB-1GB files using repartition or coalesce. Use coalesce to reduce file count without a full shuffle, and repartition when you need balanced partitions.

What is the difference between Lambda and Kappa architecture?

Lambda architecture runs separate batch (Spark) and speed (Flink) layers that merge results, providing robustness but requiring duplicate logic. Kappa architecture treats everything as a stream, using Flink for both real-time and historical reprocessing from a single codebase. Kappa is simpler to maintain but harder for historical reprocessing.

How do you ensure idempotency in batch pipelines?

Idempotency means running a job twice produces the same result. Use INSERT OVERWRITE instead of INSERT INTO to avoid duplicating data on retries. Partition data by date and overwrite entire partitions. Airflow will retry failed tasks automatically, so every task must be safe to re-run without side effects.


Originally published at: arunbaby.com/ml-system-design/0026-batch-processing-pipelines

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch