Chaos Experiment Designer
Design rigorous chaos engineering experiments that build confidence in system resilience.
Triggers
- "chaos experiment"
- "test resilience"
- "failure injection"
- "resilience testing"
- "game day"
- "chaos engineering"
Quick Reference
| Phase | Purpose | Output | |-------|---------|--------| | 1. Scope | Define system boundaries and objectives | System under test, success criteria | | 2. Baseline | Establish steady state metrics | Quantified normal behavior | | 3. Hypothesis | Form falsifiable hypothesis | Clear prediction statement | | 4. Injection | Design failure scenarios | Injection plan with blast radius | | 5. Execute | Run controlled experiment | Observation log | | 6. Analyze | Compare actual vs expected | Findings and action items |
Core Principles
Chaos engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production.
The Five Principles
- Steady State Focus: Measure observable outputs (throughput, error rates, latency percentiles), not internal metrics
- Real-World Variables: Introduce disruptions that simulate actual failure modes
- Production Testing: Experiment on live systems with real traffic patterns
- Continuous Automation: Build experiments into CI/CD pipelines
- Blast Radius Containment: Minimize customer impact through careful scoping
Process
Phase 1: Scope Definition
Define the experiment boundaries.
Inputs: System architecture, historical incidents, monitoring data
Questions to Answer:
- What system or subsystem will we test?
- What is our business justification for this experiment?
- Who are the stakeholders and who must approve?
- What is the maximum acceptable customer impact?
- What time window is safest for execution?
Output: Scoped experiment definition with stakeholder sign-off
Phase 2: Establish Baseline
Quantify normal system behavior.
Collect Steady State Metrics:
| Metric Category | Examples | Collection Period | |-----------------|----------|-------------------| | Throughput | Requests/second, transactions/minute | 7-30 days | | Error Rates | 4xx rate, 5xx rate, exception count | 7-30 days | | Latency | P50, P95, P99 response times | 7-30 days | | Resource | CPU%, Memory%, Disk I/O, Network I/O | 7-30 days | | Business | Orders/hour, active sessions, conversion rate | 7-30 days |
Define Tolerance Thresholds:
- Green: Within normal variance (baseline +/- 1 standard deviation)
- Yellow: Elevated but acceptable (baseline +/- 2 standard deviations)
- Red: Unacceptable degradation (exceeds 2 standard deviations)
Output: Baseline document with metric values and thresholds
Phase 3: Form Hypothesis
Create a falsifiable hypothesis.
Hypothesis Template:
Given [system in steady state],
When [specific failure is injected],
Then [system behavior remains within tolerance]
Because [specific resilience mechanism exists].
Example Hypotheses:
- "Given our API gateway in steady state, when we terminate 50% of backend instances, then P99 latency remains under 500ms because auto-scaling will provision replacements within 60 seconds."
- "Given our payment service in steady state, when we introduce 500ms network latency to the database, then order completion rate remains above 99% because connection pooling and retry logic handle transient delays."
Hypothesis Quality Checklist:
- [ ] Specific failure mode identified
- [ ] Quantifiable success criteria defined
- [ ] Underlying resilience mechanism named
- [ ] Timeframe for expected recovery stated
Output: Documented hypothesis with measurable predictions
Phase 4: Design Injection Plan
Plan the controlled failure injection.
Common Failure Categories:
| Category | Examples | Tools | |----------|----------|-------| | Instance Failure | Kill process, terminate VM, evict pod | chaos-monkey, kill, kubectl delete | | Network | Partition, latency, packet loss, DNS failure | tc, iptables, toxiproxy, chaos-mesh | | Resource Exhaustion | CPU spike, memory pressure, disk fill | stress-ng, dd, memory hogs | | Dependency | External service unavailable, slow response | fault injection proxy, mock services | | Time | Clock skew, NTP failure | faketime, chrony manipulation | | State | Data corruption, cache invalidation | Custom scripts |
Injection Plan Elements:
- Failure Type: Precise description of what will be broken
- Injection Method: Tool and exact commands to use
- Scope: Which instances/services/regions affected
- Duration: How long the failure persists
- Ramp-up: Gradual vs immediate injection
- Rollback: How to instantly restore normal operation
Blast Radius Containment:
- Start with smallest possible scope (single instance)
- Use canary deployment pattern for experiments
- Define automatic abort criteria
- Have rollback ready before starting
- Notify on-call before and after
Output: Detailed injection plan with rollback procedures
Phase 5: Execute Experiment
Run the controlled experiment.
Pre-Execution Checklist:
- [ ] Stakeholders notified
- [ ] On-call team aware
- [ ] Monitoring dashboards ready
- [ ] Rollback procedure tested
- [ ] Customer support briefed (for production)
- [ ] Automatic abort criteria configured
During Execution:
- Record experiment start timestamp
- Monitor all baseline metrics in real-time
- Log observations with timestamps
- If abort criteria met, execute rollback immediately
- Record experiment end timestamp
Observation Log Format:
[HH:MM:SS] - [Metric/Event]: [Value/Description]
[00:00:00] - Experiment started: Injected 500ms latency to database connection
[00:00:15] - P99 latency: 450ms -> 650ms
[00:00:30] - Circuit breaker: OPEN on database connection pool
[00:01:00] - Retry queue depth: 0 -> 247
[00:01:30] - Auto-recovery initiated
[00:02:00] - P99 latency: 650ms -> 480ms
[00:02:30] - Circuit breaker: CLOSED
[00:03:00] - Experiment ended: Removed latency injection
Output: Timestamped observation log
Phase 6: Analyze Results
Compare actual behavior against hypothesis.
Analysis Questions:
- Did system behavior stay within tolerance thresholds?
- Did resilience mechanisms activate as expected?
- What was the actual recovery time?
- Were there any unexpected cascading effects?
- Did monitoring and alerting work correctly?
Verdict Options:
| Verdict | Meaning | Action | |---------|---------|--------| | VALIDATED | Hypothesis confirmed | Document and expand scope | | INVALIDATED | Hypothesis falsified | File bugs, prioritize fixes | | INCONCLUSIVE | Unable to determine | Refine experiment design |
Finding Categories:
- Resilience Strengths: Mechanisms that worked as designed
- Weaknesses Discovered: Gaps in resilience that need fixing
- Monitoring Gaps: Missing visibility during incident
- Documentation Gaps: Runbooks or procedures that need updating
- Unexpected Behaviors: System responses not predicted
Output: Analysis document with prioritized action items
Scripts
| Script | Purpose | Usage |
|--------|---------|-------|
| generate_experiment.py | Create experiment document from inputs | python scripts/generate_experiment.py --name "API Gateway Resilience" |
| validate_experiment.py | Validate experiment document completeness | python scripts/validate_experiment.py path/to/experiment.md |
Exit Codes
| Code | Meaning | |------|---------| | 0 | Success | | 1 | General failure | | 2 | Invalid arguments | | 10 | Validation failure (missing required sections) |
Output Directory
Experiments are saved to: .agents/chaos/
.agents/chaos/
YYYY-MM-DD-experiment-name.md
YYYY-MM-DD-experiment-name-results.md
Anti-Patterns
| Avoid | Why | Instead | |-------|-----|---------| | Testing in staging only | Production has different traffic patterns | Start small in production | | No rollback plan | Cannot recover if things go wrong | Define rollback before starting | | Vague hypothesis | Cannot determine success | Use quantifiable predictions | | Measuring internal metrics only | Do not reflect customer experience | Focus on observable outputs | | Big bang experiments | Blast radius too large | Start with smallest scope | | No baseline | Cannot compare results | Collect 7+ days of metrics first | | Skipping stakeholder buy-in | Creates political problems | Get approval before execution |
Templates
Experiment Document Template
Use templates/experiment-template.md or generate with:
python scripts/generate_experiment.py \
--name "Database Failover Resilience" \
--system "Payment Service" \
--owner "Jane Smith" \
--output .agents/chaos/
Verification Checklist
Before executing any chaos experiment:
- [ ] Scope clearly defined with business justification
- [ ] Baseline metrics collected (minimum 7 days)
- [ ] Hypothesis is falsifiable with quantifiable criteria
- [ ] Injection plan includes specific tools and commands
- [ ] Blast radius is contained to acceptable scope
- [ ] Rollback procedure is documented and tested
- [ ] Stakeholders have approved the experiment
- [ ] On-call team is aware of timing
- [ ] Monitoring dashboards are ready
- [ ] Results template is prepared
Extension Points
- Failure Categories: Add new failure types to Phase 4 table
- Tools Integration: Extend scripts to integrate with chaos-mesh, Gremlin, LitmusChaos
- Automation: Integrate with CI/CD for continuous chaos testing
- Metrics Sources: Add integrations for Prometheus, Datadog, New Relic
- Scheduling: Add calendar integration for recurring game days
Related Resources
- Principles of Chaos Engineering
- Chaos Monkey (Netflix)
- Chaos Mesh (CNCF)
- LitmusChaos (CNCF)
- Gremlin (Commercial)
Related Skills
| Skill | Relationship | |-------|--------------| | security | Security review for production experiments | | devops | CI/CD integration for automated chaos | | qa | Test strategy alignment | | analyst | Root cause analysis of findings |
微信扫一扫