返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

ml-system-design

端到端的生产级ML系统设计。在设计ML流水线、特征存储、模型训练基础设施或服务系统时使用。涵盖了从数据摄入到模型部署和监控的完整生命周期。

person作者: jakexiaohubgithub

ML System Design

This skill provides frameworks for designing production machine learning systems, from data pipelines to model serving.

When to Use This Skill

Keywords: ML pipeline, machine learning system, feature store, model training, model serving, ML infrastructure, MLOps, A/B testing ML, feature engineering, model deployment

Use this skill when:

  • Designing end-to-end ML systems for production
  • Planning feature store architecture
  • Designing model training pipelines
  • Planning model serving infrastructure
  • Preparing for ML system design interviews
  • Evaluating ML platform tools and frameworks

ML System Architecture Overview

The ML System Lifecycle

┌─────────────────────────────────────────────────────────────────────────┐
│                        ML SYSTEM LIFECYCLE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐ │
│  │  Data    │──▶│ Feature  │──▶│  Model   │──▶│  Model   │──▶│ Monitor│ │
│  │ Ingestion│   │ Pipeline │   │ Training │   │ Serving  │   │ & Eval │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └────────┘ │
│       │              │              │              │              │      │
│       ▼              ▼              ▼              ▼              ▼      │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐ │
│  │  Data    │   │ Feature  │   │  Model   │   │ Inference│   │ Metrics│ │
│  │  Lake    │   │  Store   │   │ Registry │   │  Cache   │   │  Store │ │
│  └──────────┘   └──────────┘   └──────────┘   └──────────┘   └────────┘ │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Key Components

| Component | Purpose | Examples | | --------- | ------- | -------- | | Data Ingestion | Collect raw data from sources | Kafka, Kinesis, Pub/Sub | | Feature Pipeline | Transform raw data to features | Spark, Flink, dbt | | Feature Store | Store and serve features | Feast, Tecton, Vertex AI | | Model Training | Train and validate models | SageMaker, Vertex AI, Kubeflow | | Model Registry | Version and track models | MLflow, Weights & Biases | | Model Serving | Serve predictions | TensorFlow Serving, Triton, vLLM | | Monitoring | Track model performance | Evidently, WhyLabs, Arize |

Feature Store Architecture

Why Feature Stores?

Problems without a feature store:

  • Training-serving skew (features computed differently)
  • Duplicate feature computation across teams
  • No feature versioning or lineage
  • Slow feature experimentation

Feature Store Components

┌─────────────────────────────────────────────────────────────────┐
│                      FEATURE STORE                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────────┐       ┌─────────────────────┐          │
│  │   OFFLINE STORE     │       │   ONLINE STORE      │          │
│  │                     │       │                     │          │
│  │  - Historical data  │       │  - Low-latency      │          │
│  │  - Training queries │ ────▶ │  - Point lookups    │          │
│  │  - Batch features   │ sync  │  - Real-time serving│          │
│  │                     │       │                     │          │
│  │  (Data Warehouse)   │       │  (Redis, DynamoDB)  │          │
│  └─────────────────────┘       └─────────────────────┘          │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                   FEATURE REGISTRY                          ││
│  │  - Feature definitions    - Version control                 ││
│  │  - Data lineage          - Access control                   ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘

Feature Types

| Type | Computation | Storage | Example | | ---- | ----------- | ------- | ------- | | Batch | Scheduled (hourly/daily) | Offline → Online | User purchase count (30 days) | | Streaming | Real-time event processing | Direct to online | Items in cart (current) | | On-demand | Request-time computation | Not stored | Distance to nearest store |

Training-Serving Consistency

TRAINING (Historical):
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Historical   │───▶│ Point-in-Time│───▶│  Training    │
│ Events       │    │ Join         │    │  Dataset     │
└──────────────┘    └──────────────┘    └──────────────┘
                          │
                    Uses feature
                    definitions
                          │
SERVING (Real-time):      ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Online       │───▶│ Same Feature │───▶│  Prediction  │
│ Store        │    │ Definitions  │    │  Request     │
└──────────────┘    └──────────────┘    └──────────────┘

Model Training Infrastructure

Training Pipeline Components

┌───────────────────────────────────────────────────────────────────────┐
│                     TRAINING PIPELINE                                  │
├───────────────────────────────────────────────────────────────────────┤
│                                                                        │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐   ┌────────────┐   │
│  │   Data     │──▶│   Feature  │──▶│   Model    │──▶│  Model     │   │
│  │   Loader   │   │   Transform│   │   Train    │   │  Validate  │   │
│  └────────────┘   └────────────┘   └────────────┘   └────────────┘   │
│        │               │                │                 │           │
│        ▼               ▼                ▼                 ▼           │
│  ┌────────────┐   ┌────────────┐   ┌────────────┐   ┌────────────┐   │
│  │ Experiment │   │ Hyperparameter│ │  Checkpoint │  │   Model    │   │
│  │  Tracking  │   │    Tuning     │ │   Storage  │   │  Registry  │   │
│  └────────────┘   └────────────┘   └────────────┘   └────────────┘   │
│                                                                        │
└───────────────────────────────────────────────────────────────────────┘

Training Infrastructure Patterns

| Pattern | Use Case | Tools | | ------- | -------- | ----- | | Single-node | Small datasets, quick experiments | Jupyter, local GPU | | Distributed data-parallel | Large datasets, same model | Horovod, PyTorch DDP | | Model-parallel | Large models that don't fit in memory | DeepSpeed, FSDP, Megatron | | Hyperparameter tuning | Automated model optimization | Optuna, Ray Tune |

Experiment Tracking

Track for reproducibility:

| What to Track | Why | | ------------- | --- | | Hyperparameters | Reproduce training runs | | Metrics | Compare model performance | | Artifacts | Model files, datasets | | Code version | Git commit hash | | Environment | Docker image, dependencies | | Data version | Dataset hash or snapshot |

Model Serving Architecture

Serving Patterns

| Pattern | Latency | Throughput | Use Case | | ------- | ------- | ---------- | -------- | | Online (REST/gRPC) | Low (<100ms) | Medium | Real-time predictions | | Batch | High (hours) | Very high | Bulk scoring | | Streaming | Medium | High | Event-driven predictions | | Embedded | Very low | Varies | Edge/mobile inference |

Online Serving Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     MODEL SERVING SYSTEM                             │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌──────────────┐                                                  │
│   │   Clients    │                                                  │
│   └──────┬───────┘                                                  │
│          │                                                          │
│          ▼                                                          │
│   ┌──────────────┐                                                  │
│   │ Load Balancer│                                                  │
│   └──────┬───────┘                                                  │
│          │                                                          │
│          ▼                                                          │
│   ┌──────────────────────────────────────────────────────────────┐  │
│   │                    API Gateway                                │  │
│   │  - Authentication   - Rate limiting   - Request validation   │  │
│   └──────────────────────────────┬───────────────────────────────┘  │
│                                  │                                  │
│          ┌───────────────────────┼───────────────────────┐         │
│          ▼                       ▼                       ▼         │
│   ┌────────────┐          ┌────────────┐          ┌────────────┐  │
│   │  Model A   │          │  Model B   │          │  Model C   │  │
│   │  (v1.2)    │          │  (v2.0)    │          │  (v1.0)    │  │
│   └────────────┘          └────────────┘          └────────────┘  │
│          │                       │                       │         │
│          └───────────────────────┼───────────────────────┘         │
│                                  ▼                                  │
│                         ┌────────────────┐                         │
│                         │ Feature Store  │                         │
│                         │ (Online)       │                         │
│                         └────────────────┘                         │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Latency Optimization

| Technique | Latency Impact | Trade-off | | --------- | -------------- | --------- | | Batching | Reduces per-request | Increases latency for first request | | Caching | 10-100x faster | May serve stale predictions | | Quantization | 2-4x faster | Slight accuracy loss | | Distillation | Variable | Training overhead | | GPU inference | 10-100x faster | Cost increase |

A/B Testing ML Models

Experiment Design

┌─────────────────────────────────────────────────────────────────────┐
│                      A/B TESTING ARCHITECTURE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│   ┌──────────────┐                                                  │
│   │   Traffic    │                                                  │
│   └──────┬───────┘                                                  │
│          │                                                          │
│          ▼                                                          │
│   ┌──────────────────────┐                                          │
│   │ Experiment Assignment │ ◀─────── Experiment Config              │
│   │ - User bucketing      │          - Allocation %                 │
│   │ - Feature flags       │          - Target segments              │
│   └──────────┬───────────┘          - Guardrails                   │
│              │                                                       │
│     ┌────────┴────────┐                                             │
│     ▼                 ▼                                             │
│ ┌────────┐       ┌────────┐                                         │
│ │Control │       │Treatment│                                        │
│ │Model A │       │Model B  │                                        │
│ └────┬───┘       └────┬───┘                                         │
│      │                │                                              │
│      └────────┬───────┘                                             │
│               ▼                                                      │
│      ┌────────────────┐                                             │
│      │ Metrics Logger │                                             │
│      └────────┬───────┘                                             │
│               ▼                                                      │
│      ┌────────────────┐                                             │
│      │ Statistical    │ ─────▶ Decision: Ship / Iterate / Kill     │
│      │ Analysis       │                                             │
│      └────────────────┘                                             │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Metrics to Track

| Metric Type | Examples | Purpose | | ----------- | -------- | ------- | | Model metrics | AUC, RMSE, precision/recall | Model quality | | Business metrics | CTR, conversion, revenue | Business impact | | Guardrail metrics | Latency, error rate, engagement | Prevent regressions | | Segment metrics | Metrics by user segment | Detect heterogeneous effects |

Statistical Considerations

  • Sample size: Calculate power before experiment
  • Duration: Account for novelty effects and time patterns
  • Multiple testing: Adjust for multiple metrics (Bonferroni, FDR)
  • Early stopping: Use sequential testing methods

Model Monitoring

What to Monitor

| Category | Metrics | Alert Threshold | | -------- | ------- | --------------- | | Data quality | Missing values, schema drift | >1% change | | Feature drift | Distribution shift (PSI, KL) | PSI >0.2 | | Prediction drift | Output distribution shift | Depends on use case | | Model performance | Accuracy, AUC (when labels available) | >5% degradation | | Operational | Latency, throughput, errors | SLO violations |

Drift Detection

┌─────────────────────────────────────────────────────────────────────┐
│                      DRIFT DETECTION PIPELINE                        │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Training Data                 Production Data                       │
│  ┌──────────────┐              ┌──────────────┐                     │
│  │ Reference    │              │   Current    │                     │
│  │ Distribution │              │ Distribution │                     │
│  └──────┬───────┘              └──────┬───────┘                     │
│         │                             │                              │
│         └──────────────┬──────────────┘                              │
│                        ▼                                             │
│              ┌──────────────────┐                                   │
│              │ Statistical Test │                                   │
│              │ - PSI (Population Stability Index)                   │
│              │ - KS Test                                            │
│              │ - Chi-squared                                        │
│              └────────┬─────────┘                                   │
│                       ▼                                              │
│              ┌──────────────────┐                                   │
│              │  Drift Score     │                                   │
│              └────────┬─────────┘                                   │
│                       │                                              │
│           ┌───────────┼───────────┐                                 │
│           ▼           ▼           ▼                                 │
│      No Drift    Warning     Critical                               │
│      (< 0.1)    (0.1-0.2)    (> 0.2)                               │
│         │           │           │                                   │
│         ▼           ▼           ▼                                   │
│      Continue    Investigate   Retrain                              │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Common ML System Design Patterns

Pattern 1: Recommendation System

Components needed:
- Candidate Generation (retrieve 100s-1000s)
- Ranking Model (score and sort)
- Feature Store (user features, item features)
- Real-time personalization (recent behavior)
- A/B testing infrastructure

Pattern 2: Fraud Detection

Components needed:
- Real-time feature computation
- Low-latency model serving (<50ms)
- High recall focus (can't miss fraud)
- Explainability for compliance
- Human-in-the-loop review
- Feedback loop for labels

Pattern 3: Search Ranking

Components needed:
- Two-stage ranking (retrieval + ranking)
- Feature store for query/document features
- Low latency (<200ms end-to-end)
- Learning to rank models
- Click-through rate prediction
- A/B testing with interleaving

Estimation for ML Systems

Training Infrastructure

Training time estimation:
- Dataset size: 100M examples
- Model: Transformer (100M params)
- GPU: A100 (80GB, 312 TFLOPS)
- Batch size: 32
- Training steps: Dataset / batch = 3.1M steps
- Time per step: ~100ms
- Total time: ~86 hours single GPU
- With 8 GPUs (data parallel): ~11 hours

Serving Infrastructure

Inference estimation:
- QPS: 10,000
- Model latency: 20ms
- Batch size: 1 (real-time)
- GPU utilization: 50% (latency constraint)
- Requests per GPU/sec: 25
- GPUs needed: 10,000 / 25 = 400 GPUs
- With batching (batch 8): 100 GPUs (4x reduction)

Related Skills

  • llm-serving-patterns - LLM-specific serving and optimization
  • rag-architecture - Retrieval-Augmented Generation patterns
  • vector-databases - Vector search and embeddings
  • ml-inference-optimization - Latency and cost optimization
  • estimation-techniques - Back-of-envelope calculations
  • quality-attributes-taxonomy - NFR definitions

Related Commands

  • /sd:ml-pipeline <problem> - Design ML system interactively
  • /sd:estimate <scenario> - Capacity calculations

Related Agents

  • ml-systems-designer - Design ML architectures
  • ml-interviewer - Mock ML system design interviews

Version History

  • v1.0.0 (2025-12-26): Initial release

Last Updated

Date: 2025-12-26 Model: claude-opus-4-5-20251101