Systematic Debugging

Random fixes waste time. Quick patches mask issues.

Core principle: ALWAYS find root cause before fixes. Symptom fixes are failure.

The Iron Law

NO FIXES WITHOUT ROOT CAUSE INVESTIGATION

When to Use

ANY technical issue: test failures, bugs, unexpected behavior, performance, builds, integration.

ESPECIALLY when:

Under time pressure
"Just one quick fix" seems obvious
Already tried multiple fixes
Previous fix didn't work
Don't fully understand issue

Don't skip when:

Seems simple (simple bugs have root causes)
You're hurrying (systematic is faster than thrashing)
Manager wants NOW (systematic prevents rework)

Four Phases

Phase 1: Root Cause Investigation

For test failures, check flakiness FIRST:

Test fails → Run 5x
├─ Passes 5/5: Not flaky, investigate as bug
├─ Fails 5/5: Consistent, investigate as bug
└─ Mixed (3/5): FLAKY TEST - fix test first

Flaky test checklist:

| Check | How | Fix | | --------------------- | ------------------------ | ----------------------- | | Isolated/connected? | Run single vs suite | State pollution | | Timing-dependent? | Look for timeouts/sleeps | Condition-based waiting | | Environment-specific? | CI vs local | Mock env vars | | Order-dependent? | Different order | Setup/teardown | | Race condition? | Async without waits | Proper async/await |

Then continue:

Read Errors Carefully - Stack traces, line numbers, error codes
Reproduce Consistently - Exact steps, happens every time?
Check Recent Changes - Git diff, dependencies, config
Multi-Component Systems - Add diagnostic instrumentation at boundaries BEFORE proposing fixes
Trace Data Flow - Where does bad value originate? (See pop-root-cause-tracing)

Phase 2: Pattern Analysis

Find Working Examples - Similar code that works
Compare References - Read reference implementations COMPLETELY
Identify Differences - List ALL differences
Understand Dependencies - Config, environment, assumptions

Phase 3: Hypothesis & Testing

Form Single Hypothesis - "I think X causes Y because Z"
Test Minimally - Smallest change, one variable
Verify - Worked? → Phase 4. Didn't? → New hypothesis
When Unknown - Say "I don't understand X", ask for help

Phase 4: Implementation

Create Failing Test - Use test-driven-development skill
Implement Single Fix - Address root cause, ONE change
Verify Fix - Test passes, no other tests broken
If Fix Doesn't Work
- STOP. Count fixes tried.
- If < 3: Return to Phase 1 with new info
- If >= 3: STOP. Question architecture (see below)
If 3+ Fixes Failed: Question Architecture
- Each fix reveals new problems elsewhere
- Fixes require "massive refactoring"
- Pattern fundamentally unsound?
- Discuss with user before more fixes

Red Flags

STOP if thinking:

"Quick fix for now"
"Just try X and see"
"Add multiple changes"
"Skip test, manually verify"
"It's probably X"
"Don't fully understand but..."
"One more fix" (after 2+)

ALL → Return to Phase 1

3+ failures → Question architecture

Quick Reference

| Phase | Key Activities | Success | | ------------- | --------------------------------------- | --------------------- | | 1. Root Cause | Read errors, reproduce, gather evidence | Understand WHAT & WHY | | 2. Pattern | Find working examples, compare | Identify differences | | 3. Hypothesis | Form theory, test minimally | Confirmed or new | | 4. Implement | Test, fix, verify | Resolved, tests pass |

Real-World Impact

Systematic: 15-30min to fix, 95% first-time success, near-zero new bugs
Random: 2-3h thrashing, 40% success, common new bugs

Cross-References

Flaky tests: pop-test-driven-development (Condition-Based Waiting)
Root cause tracing: pop-root-cause-tracing (backward tracing)
Defense: pop-defense-in-depth (multi-layer validation)

Examples

See examples/ for:

flaky-test-patterns.md - Common flaky test causes & fixes
debugging-flowchart.pdf - Visual decision tree
multi-component-diagnostic.md - Instrumentation strategy