Skill: Debug Root Cause Analysis

Purpose

Prevent "Shotgun Debugging" where agents randomly change code hoping errors vanish. Enforce a scientific method approach to debugging that isolates the root cause before applying fixes, preventing introduction of new bugs while fixing old ones.

1. Negative Knowledge (Anti-Patterns)

2. Verified Debugging Procedure

The Scientific Method for Debugging

1. OBSERVE   → Gather evidence (logs, stack traces, user reports)
2. REPRODUCE → Create minimal reproduction case
3. HYPOTHESIZE → Form testable theories about the cause
4. TEST → Verify each hypothesis systematically
5. FIX → Apply minimal fix to root cause
6. VERIFY → Confirm fix resolves issue and doesn't break anything
7. PREVENT → Add regression test

Phase 1: OBSERVE - Gather Evidence

Collect all available information:

# Check application logs
tail -n 100 logs/error.log

# Check recent commits that may have introduced the bug
git log --since="2 days ago" --oneline

# Check for similar errors
grep -r "ErrorMessage" logs/

# Review stack trace carefully
# Note: Line numbers, function names, timestamps

Questions to answer:

When did the bug first appear?
What changed recently (code, dependencies, config)?
Is it consistent or intermittent?
What's the exact error message and stack trace?
What data was being processed when it failed?

Phase 2: REPRODUCE - Create Minimal Case

Goal: Reliably trigger the bug with minimal setup.

// Example: Creating a minimal reproduction test
describe('Bug Reproduction', () => {
  it('should reproduce the error when...', async () => {
    // Arrange: Set up minimal conditions
    const input = {
      userId: 'test-user',
      amount: -100  // Hypothesis: negative amounts cause crash
    };

    // Act & Assert: Verify the bug occurs
    await expect(
      processPayment(input)
    ).rejects.toThrow('Cannot process negative amount');
  });
});

If you cannot reproduce:

Verify you have the same environment (Node version, dependencies)
Check if it's environment-specific (production vs development)
Look for missing configuration or state

Phase 3: HYPOTHESIZE - Form Theories

Generate testable hypotheses based on evidence:

Analyze the stack trace: Start from the innermost function
Check recent changes: git diff main...HEAD
Review data flow: Trace how data reaches the failing code
Consider edge cases: Null values, empty arrays, special characters

Example hypothesis formation:

Stack trace shows: TypeError: Cannot read property 'id' of undefined
Location: userService.ts:45

Hypothesis 1: User object is undefined when passed to service
Hypothesis 2: User object exists but missing 'id' property
Hypothesis 3: Async race condition, user not loaded yet

Next step: Add logging at userService.ts:44 to check user object

Phase 4: TEST - Verify Hypothesis

Use the zero-context script to analyze logs:

# Run log analysis to find patterns
python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError"

Add strategic logging to test hypothesis:

// BEFORE (hypothesis testing)
async function getUserById(id: string) {
  console.log('[DEBUG] getUserById called with:', id);
  const user = await db.users.findOne({ id });
  console.log('[DEBUG] User found:', user);

  if (!user) {
    console.log('[DEBUG] User not found for ID:', id);
    throw new Error('User not found');
  }

  return user;
}

Verify with a test:

it('should handle missing user gracefully', async () => {
  const result = await getUserById('nonexistent-id');
  // If this throws TypeError about 'id', hypothesis 1 is confirmed
});

Phase 5: FIX - Apply Minimal Fix

Once root cause is confirmed, apply the minimal fix:

// AFTER (minimal fix)
async function getUserById(id: string) {
  const user = await db.users.findOne({ id });

  if (!user) {
    throw new Error(`User not found: ${id}`);
  }

  return user;
}

Principles:

Fix ONLY the root cause
Don't refactor during bug fixing
Don't add features while fixing bugs
Keep the fix as simple as possible

Phase 6: VERIFY - Confirm Fix

Verification checklist:

# 1. Run the reproduction test
npm test -- bug-reproduction.test.ts
# Expected: PASS

# 2. Run full test suite
npm test
# Expected: All tests PASS (no regressions)

# 3. Manual verification (if applicable)
npm run dev
# Test the actual user flow that triggered the bug

# 4. Check logs are clean
tail -f logs/error.log
# Expected: No errors when performing the previously failing action

Phase 7: PREVENT - Add Regression Test

Convert your reproduction case into a permanent test:

// tests/unit/services/UserService.test.ts
describe('UserService.getUserById', () => {
  it('should throw clear error when user not found', async () => {
    const service = new UserService(mockDb);

    await expect(
      service.getUserById('nonexistent-id')
    ).rejects.toThrow('User not found: nonexistent-id');
  });

  it('should handle null user gracefully', async () => {
    mockDb.users.findOne.mockResolvedValue(null);

    await expect(
      service.getUserById('test-id')
    ).rejects.toThrow('User not found');
  });
});

3. Zero-Context Scripts

analyze_logs.py

Located at: .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py

Purpose: Parse error logs for frequency, patterns, and correlations.

Usage:

python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError|ReferenceError|null" \
  --time-range "last 24h"

Output:

Error Frequency Analysis:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TypeError: Cannot read property 'id': 47 occurrences
  First seen: 2026-01-01 10:23:45
  Last seen:  2026-01-01 14:32:10
  Peak time:  2026-01-01 12:00-13:00 (23 errors)

  Stack trace (most common):
    at userService.ts:45
    at authMiddleware.ts:89

ReferenceError: user is not defined: 3 occurrences
  ...

4. Debugging Workflow for Agents

When you encounter a bug or error:

Stop and Observe:
- Don't immediately change code
- Read the full error message and stack trace
- Note the file, line number, and error type

Gather Context:

# What files are involved?
grep -r "functionName" src/

# What changed recently?
git log --oneline -10

# Are there logs?
ls -la logs/ && tail logs/error.log

Create Reproduction Test:
- Write a failing test that triggers the bug
- Make it as minimal as possible
- Verify it actually fails
Analyze Systematically:
- Use the zero-context script to analyze logs
- Add strategic logging (NOT shotgun console.logs)
- Test one hypothesis at a time
Apply Minimal Fix:
- Fix only the root cause
- Don't refactor or add features
- Keep changes minimal and focused
Verify Thoroughly:
- Run the reproduction test
- Run the full test suite
- Manual verification if needed
Prevent Recurrence:
- Convert reproduction to permanent test
- Document the root cause in comments if complex
- Consider if similar bugs could exist elsewhere

Example Agent Session

User: "The app crashes when I try to delete a user"

Agent: "I'll debug this systematically using root cause analysis."

[OBSERVE]
Agent: Reading error logs...
Found: TypeError: Cannot read property 'posts' of undefined at deleteUser:67

[REPRODUCE]
Agent: Creating reproduction test...
Writing: tests/bugs/delete-user-crash.test.ts

[RUN TEST]
Agent: Running test to confirm reproduction...
Output: FAIL - Confirmed the crash occurs

[HYPOTHESIZE]
Agent: Stack trace shows crash at line 67 when accessing user.posts.
Hypothesis: user object is undefined before we try to access posts.
Let me check the code...

[READ CODE]
Agent: Reading userService.ts:60-70...
Found issue: deleteUser() doesn't check if user exists before accessing user.posts

[FIX]
Agent: Applying minimal fix - adding null check before accessing user.posts

[VERIFY]
Agent: Running reproduction test...
Output: PASS
Running full test suite...
Output: All tests PASS

[PREVENT]
Agent: Converting reproduction test to permanent regression test
Committing: "fix: add null check before accessing user.posts in deleteUser"

5. Failed Attempts (Negative Knowledge Evolution)

❌ Attempt: Debug by commenting out code

Context: Tried to isolate bug by commenting out sections Failure: Changed program behavior, couldn't identify actual cause Learning: Use logging and breakpoints, not code modification

❌ Attempt: Apply fix before confirming root cause

Context: Saw similar bug on Stack Overflow, applied their solution Failure: Didn't fix our bug, introduced new edge case Learning: Always verify the root cause matches before applying fixes

❌ Attempt: Add try-catch around everything

Context: Wrapped failing code in try-catch to "fix" errors Failure: Silenced errors, made debugging harder, root cause unfixed Learning: Fix the cause, don't suppress the symptom

❌ Attempt: Debug in production

Context: Couldn't reproduce locally, added logging to production Failure: Exposed sensitive data in logs, caused performance issues Learning: Reproduce locally or use proper observability tools

❌ Attempt: Multi-variable changes

Context: Changed error handling AND data validation AND logging Failure: Bug disappeared but couldn't identify which change fixed it Learning: Change one variable at a time, verify after each change

6. Common Bug Categories

Null/Undefined Errors

Symptom: TypeError: Cannot read property 'X' of undefined
Common causes: Missing null checks, async race conditions
Fix approach: Add null checks, ensure async operations complete

Type Errors

Symptom: Type mismatch, unexpected type coercion
Common causes: Weak typing, incorrect assumptions
Fix approach: Add type guards, validate inputs

Race Conditions

Symptom: Intermittent failures, works sometimes
Common causes: Async operations not awaited, shared state
Fix approach: Proper async/await, eliminate shared mutable state

Off-by-One Errors

Symptom: Array index out of bounds, fencepost errors
Common causes: Loop conditions, array slicing
Fix approach: Careful boundary analysis, add tests for edge cases

Configuration Errors

Symptom: Works locally, fails in production
Common causes: Missing env vars, different configurations
Fix approach: Validate configuration on startup, use same config locally

7. Governance

Token Budget: ~495 lines (within 500 limit)
Dependencies: Python 3.8+ for log analysis script
Pattern Origin: Scientific Method, Systematic Debugging (Andreas Zeller)
Maintenance: Update anti-patterns as new failure modes discovered
Verification Date: 2026-01-01