Batch Serving Pattern

What It Is

A machine learning inference pattern where predictions are generated for large datasets at scheduled intervals rather than in response to individual real-time requests. The model processes accumulated data in bulk batches, writing results to storage for later consumption by downstream applications. This pattern decouples prediction generation from prediction consumption, optimizing for throughput over latency.

When to Use It

Predictions don't require immediate (sub-second) responses
Processing millions of predictions daily (recommendation feeds, email campaigns, report generation)
Compute costs matter more than latency (batch jobs run during off-peak hours)
Input data naturally accumulates in batches (daily transactions, overnight data pipelines)
High-throughput requirements where parallelization across many records is possible
Predictions consumed by scheduled processes (daily dashboards, weekly reports, nightly ETL jobs)

Execution Steps

1. Define Batch Cadence and Scope

Determine how frequently batches run and what data each batch processes. Cadence balances freshness requirements against computational efficiency.

Action: Answer: How fresh do predictions need to be? (Real-time, hourly, daily, weekly?) What triggers a batch? (Schedule, data threshold, manual trigger?) What's the input data source? (Database table, data warehouse, object storage?) Document: "Run batch {cadence} processing {data_volume} records from {source} writing to {destination}."

2. Design Data Partitioning Strategy

Split large datasets into processable chunks to enable parallel processing and failure recovery. Partitioning prevents memory overflow and allows retrying failed partitions without reprocessing everything.

Action: Choose partitioning scheme: Time-based (process records by date/hour), ID-based (partition by user_id ranges), Size-based (fixed chunks of N records), or Hybrid (time + size). Implement partition tracking to record which partitions completed successfully. Write partition metadata to enable idempotent retries.

3. Implement Batch Inference Pipeline

Build the ETL pipeline that loads data, applies the model, and writes predictions. Optimize for throughput using vectorization and parallelization.

Action: Load batch data efficiently (use columnar formats like Parquet, query only needed columns). Apply model inference using batch prediction APIs (model.predict(batch) not model.predict(record) in loop). Vectorize operations using NumPy/Pandas for 10-100x speedup. Write predictions with batch writes (bulk inserts, not row-by-row).

4. Configure Resource Allocation

Size compute resources for batch workload. Batch jobs typically use large instances for short durations rather than small instances continuously.

Action: Estimate memory needs: model_size + batch_size * record_size + overhead. Choose instance type: CPU for tree-based models, GPU for deep learning. Implement parallelization: multiple workers processing different partitions simultaneously. Monitor resource utilization to right-size instances (target 70-90% CPU/GPU usage).

5. Handle Failures and Monitoring

Build robustness for multi-hour batch jobs where transient failures are inevitable. Track progress to enable resumption and debugging.

Action: Implement checkpointing: save progress after each partition completes. Design retry logic: exponential backoff for transient errors, skip and log for data errors. Log metrics per batch: records processed, predictions generated, processing time, error count. Set up alerts: batch didn't complete within expected window, error rate exceeds threshold.

6. Optimize Prediction Storage and Access

Structure prediction output for efficient consumption by downstream systems. Consider storage format, indexing, and access patterns.

Action: Choose storage: Database (PostgreSQL, MySQL) for relational queries, Data warehouse (BigQuery, Redshift) for analytics, Object storage (S3, GCS) for large-scale dumps, Cache (Redis) for high-frequency access. Add indexes on query patterns (user_id, timestamp). Include metadata: prediction timestamp, model version, confidence scores. Implement retention policy: delete predictions older than N days.

Real-World Applications

Recommendation Systems

Netflix generates personalized video recommendations nightly for 200M+ users
Spotify creates "Discover Weekly" playlists every Monday using batch inference
Amazon pre-computes product recommendations for browsing pages

Marketing and Communication

Email service providers score engagement probability for millions of recipients before campaign send
Ad platforms pre-generate audience targeting scores overnight for next-day campaign optimization
CRM systems batch-score lead quality for sales team prioritization

Risk and Fraud

Credit card companies generate daily fraud risk scores for all active accounts
Insurance providers batch-calculate claim fraud likelihood for overnight review queues
Banks compute credit risk scores monthly for portfolio monitoring

Business Intelligence

Retailers forecast demand for thousands of SKUs across hundreds of locations daily
Healthcare systems predict patient readmission risk for care management programs
Financial services generate customer churn predictions weekly for retention campaigns

Anti-Patterns

Using batch serving for time-sensitive decisions → User experiences stale predictions; if decision impacts immediate user experience (fraud detection, search ranking), use real-time inference.

Processing entire dataset when only subset changed → Wastes compute on unchanged records; implement incremental processing to predict only new/updated records.

No partition tolerance → Single record failure kills entire batch; implement partition-level error handling to skip bad records without reprocessing millions.

Ignoring model versioning → Can't reproduce predictions or diagnose issues; always tag predictions with model version and timestamp.

Writing predictions without indexes → Downstream queries are slow; add indexes on access patterns before batch completes.

Running batch during peak hours → Competes with user-facing workloads for resources; schedule during off-peak windows (nights, weekends).

Success Metrics

Batch completes within scheduled window (e.g., 4-hour job finishes in 3 hours with buffer)
Cost per prediction decreases vs real-time serving (typically 10-100x cheaper)
Throughput meets business requirements (millions of predictions per hour)
Error rate under threshold (<0.1% of predictions fail)
Downstream systems consume predictions successfully (no data quality issues)
Resource utilization efficient (70-90% CPU/GPU usage during processing)

Related Frameworks

Streaming Inference Pattern: Real-time alternative when low latency required
Online Learning Pattern: Continuous model updates as new data arrives
Lambda Architecture: Combining batch and streaming for comprehensive processing
Feature Store Pattern: Managing feature computation for batch inference

Common Pitfalls

Not accounting for data growth leading to batch duration exceeding window
Lack of idempotency causing duplicate predictions on retries
Insufficient monitoring making debugging failures difficult
No gradual rollout when deploying new model versions
Overwriting previous predictions without audit trail
Not partitioning large batches leading to memory issues
Forgetting to version control prediction schemas causing breaking changes

Tools & Resources

Batch Processing Frameworks: Apache Spark, Dask, Ray for distributed processing
Workflow Orchestration: Airflow, Prefect, Kubeflow for scheduling and monitoring
Model Serving: TensorFlow Batch Prediction, PyTorch inference, Scikit-learn joblib
Storage Solutions: PostgreSQL, BigQuery, Redshift, S3, Parquet files
Monitoring: Datadog, Prometheus, CloudWatch for job observability
References: "Machine Learning Design Patterns" (Lakshmanan et al.), Google's "Rules of Machine Learning", Databricks ML guides

Batch vs Real-Time Decision Matrix

Choose Batch Serving When:

Latency tolerance > 1 hour
Processing millions of predictions per job
Predictions consumed by scheduled processes
Cost optimization is priority
Input data arrives in natural batches

Choose Real-Time Serving When:

Latency requirement < 100ms
Predictions drive immediate user actions
Request-response pattern required
Input data arrives as individual events
Freshness critical (fraud, content ranking)

Hybrid Approach: Many production systems use both patterns - batch for base recommendations/scores, real-time for personalization adjustments.

Framework Type: ML System Design Pattern Domain: Machine Learning Operations, System Design Practitioner Score: 9/10 - Foundational pattern for production ML, used by all major platforms Complexity: Medium - Requires understanding of distributed systems, ETL pipelines, model deployment Prerequisites: ML model deployment basics, batch processing frameworks, data pipeline design