System Design Analysis

Analyze distributed system designs for scalability, reliability, performance, and security. Produce structured review documents with gaps and actionable recommendations.

Core Principle

System design is about trade-offs, not perfect answers. Every recommendation must consider context: access patterns, scale requirements, consistency needs, and operational constraints.

Workflow

Phase 1: Information Gathering

Ask focused questions to understand the system. Prioritize these areas:

Functional scope: What does the system do? Core operations?
Scale: Expected QPS, data volume, user count?
Access patterns: Read-heavy vs write-heavy? Hot spots?
Consistency requirements: Strong vs eventual? Where?
Availability targets: SLA requirements? Acceptable downtime?
Current architecture: Existing components, databases, services?
Known pain points: What's broken or struggling today?

Limit to 3-5 questions per message. Make reasonable assumptions when information is missing—document assumptions explicitly.

Phase 2: Topic Analysis

Based on gathered information, analyze relevant system design topics. Load reference files as needed:

| Topic | Reference File | When to Load | |-------|----------------|--------------| | Load balancing | load-balancing.md | Traffic distribution, L4/L7 decisions | | Caching | caching.md | Latency optimization, read scaling | | Databases | databases.md | Data modeling, SQL vs NoSQL choices | | CAP & Consistency | cap-consistency.md | Consistency model decisions | | Sharding | sharding-partitioning.md | Write/storage scaling | | Replication | replication.md | Availability, read scaling | | Message queues | message-queues.md | Async processing, decoupling | | Rate limiting | rate-limiting.md | Traffic protection, abuse prevention | | Auth | auth.md | Security, identity management | | Resilience | resilience-patterns.md | Failure handling, fault tolerance | | Monitoring | monitoring-observability.md | Observability, debugging |

Load only topics relevant to the specific system under review.

Phase 3: Document Generation

Produce a structured analysis document with these sections:

# System Design Analysis: [System Name]

## 1. Abstract
Brief summary of the system and analysis scope (2-3 paragraphs).

## 2. Requirements

### 2.1 Stated Requirements
Requirements explicitly provided by user.

### 2.2 Assumed Requirements
Reasonable assumptions with rationale. Format:
- **Assumption**: [what was assumed]
- **Rationale**: [why this is reasonable]

## 3. Current System Review
Analysis of existing architecture against requirements. Organize by topic area.

## 4. Gaps
Identified issues, risks, or missing capabilities. Prioritize by impact:
- **Critical**: System failures, data loss risks
- **High**: Performance bottlenecks, scalability limits
- **Medium**: Operational inefficiencies, maintainability issues
- **Low**: Nice-to-have improvements

## 5. Recommendations
Actionable improvements with:
- **Problem addressed**: Which gap(s) this solves
- **Recommendation**: Specific technical approach
- **Example**: Concrete implementation guidance
- **Trade-offs**: What you gain vs what you sacrifice
- **Impact**: Expected improvement if implemented

Analysis Checklist

For each relevant topic, evaluate:

Load Balancing

[ ] Algorithm appropriate for workload (round robin, least connections, consistent hashing)?
[ ] L4 vs L7 appropriate for use case?
[ ] LB itself highly available?
[ ] Health checks configured?

Caching

[ ] Cache strategy defined (cache-aside, write-through)?
[ ] Eviction policy appropriate (LRU, TTL)?
[ ] Cache invalidation strategy?
[ ] Hot key and cache stampede handling?

Databases

[ ] Data model matches access patterns?
[ ] Indexes support critical queries?
[ ] Read-heavy vs write-heavy considered?
[ ] Appropriate SQL vs NoSQL choice?

CAP & Consistency

[ ] Consistency model matches business requirements?
[ ] Trade-offs between C and A explicit?
[ ] Read-your-writes where needed?

Sharding

[ ] Shard key distributes load evenly?
[ ] Hot partitions addressed?
[ ] Cross-shard operations minimized?

Replication

[ ] Sync vs async replication appropriate?
[ ] Replica lag acceptable?
[ ] Leader election mechanism defined?
[ ] Split-brain prevention?

Message Queues

[ ] Delivery guarantees appropriate?
[ ] Consumer idempotency?
[ ] Dead-letter queue for failures?
[ ] Backpressure handling?

Rate Limiting

[ ] Algorithm chosen (token bucket recommended)?
[ ] Limits appropriate for different tiers?
[ ] Distributed enforcement for multi-node?
[ ] Graceful handling of limit breaches?

Authentication & Authorization

[ ] AuthN mechanism appropriate (JWT, sessions)?
[ ] Token lifecycle managed (expiry, refresh)?
[ ] AuthZ model defined (RBAC, ABAC)?
[ ] Service-to-service auth?

Resilience

[ ] Timeouts on all external calls?
[ ] Retry strategy with backoff?
[ ] Circuit breakers for unstable dependencies?
[ ] Graceful degradation paths?

Monitoring

[ ] Golden signals tracked (latency, traffic, errors, saturation)?
[ ] Distributed tracing for request flows?
[ ] Structured logging?
[ ] Alerts tied to SLOs, not raw metrics?

Common Anti-Patterns to Flag

No caching strategy: "Just add Redis" without invalidation plan
Wrong database choice: Forcing SQL for graph data or NoSQL for transactions
Ignoring partition tolerance: Designing as if network never fails
Naive sharding: Choosing shard key without considering access patterns
Synchronous everything: No async processing for non-critical paths
Alert fatigue: Alerting on every error instead of user impact
Missing rate limiting: No protection against traffic spikes
Stateless assumption violations: Session stickiness breaking horizontal scaling