Back to skills
extension
Category: Development & EngineeringNo API key required

debug-root-cause-analysis

Use when fixing a bug or analyzing a stack trace. Enforces systematic debugging methodology to prevent shotgun debugging and verify fixes.

personAuthor: jakexiaohubgithub

Skill: Debug Root Cause Analysis

Purpose

Prevent "Shotgun Debugging" where agents randomly change code hoping errors vanish. Enforce a scientific method approach to debugging that isolates the root cause before applying fixes, preventing introduction of new bugs while fixing old ones.

1. Negative Knowledge (Anti-Patterns)

| Failure Pattern | Context | Why It Fails | | :--- | :--- | :--- | | Shotgun Debugging | Changing multiple variables at once | Cannot isolate cause, introduces new bugs | | Premature Fix | Applying a patch before reproducing failure | Fix is unverified, may not address root cause | | Log Blindness | Ignoring severity levels or timestamps | Chasing symptoms, not causes | | Assumption-Based Debugging | "I think it's X" without verification | Wastes time on wrong leads | | Print Debugging Only | Adding console.logs without structured approach | Creates noise, hard to trace logic flow | | Copy-Paste Solutions | Applying Stack Overflow answers without understanding | May not fit context, creates technical debt | | Skipping Reproduction | Attempting fix without reproducing bug | Cannot verify fix works |

2. Verified Debugging Procedure

The Scientific Method for Debugging

1. OBSERVE   → Gather evidence (logs, stack traces, user reports)
2. REPRODUCE → Create minimal reproduction case
3. HYPOTHESIZE → Form testable theories about the cause
4. TEST → Verify each hypothesis systematically
5. FIX → Apply minimal fix to root cause
6. VERIFY → Confirm fix resolves issue and doesn't break anything
7. PREVENT → Add regression test

Phase 1: OBSERVE - Gather Evidence

Collect all available information:

# Check application logs
tail -n 100 logs/error.log

# Check recent commits that may have introduced the bug
git log --since="2 days ago" --oneline

# Check for similar errors
grep -r "ErrorMessage" logs/

# Review stack trace carefully
# Note: Line numbers, function names, timestamps

Questions to answer:

  • When did the bug first appear?
  • What changed recently (code, dependencies, config)?
  • Is it consistent or intermittent?
  • What's the exact error message and stack trace?
  • What data was being processed when it failed?

Phase 2: REPRODUCE - Create Minimal Case

Goal: Reliably trigger the bug with minimal setup.

// Example: Creating a minimal reproduction test
describe('Bug Reproduction', () => {
  it('should reproduce the error when...', async () => {
    // Arrange: Set up minimal conditions
    const input = {
      userId: 'test-user',
      amount: -100  // Hypothesis: negative amounts cause crash
    };

    // Act & Assert: Verify the bug occurs
    await expect(
      processPayment(input)
    ).rejects.toThrow('Cannot process negative amount');
  });
});

If you cannot reproduce:

  • Verify you have the same environment (Node version, dependencies)
  • Check if it's environment-specific (production vs development)
  • Look for missing configuration or state

Phase 3: HYPOTHESIZE - Form Theories

Generate testable hypotheses based on evidence:

  1. Analyze the stack trace: Start from the innermost function
  2. Check recent changes: git diff main...HEAD
  3. Review data flow: Trace how data reaches the failing code
  4. Consider edge cases: Null values, empty arrays, special characters

Example hypothesis formation:

Stack trace shows: TypeError: Cannot read property 'id' of undefined
Location: userService.ts:45

Hypothesis 1: User object is undefined when passed to service
Hypothesis 2: User object exists but missing 'id' property
Hypothesis 3: Async race condition, user not loaded yet

Next step: Add logging at userService.ts:44 to check user object

Phase 4: TEST - Verify Hypothesis

Use the zero-context script to analyze logs:

# Run log analysis to find patterns
python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError"

Add strategic logging to test hypothesis:

// BEFORE (hypothesis testing)
async function getUserById(id: string) {
  console.log('[DEBUG] getUserById called with:', id);
  const user = await db.users.findOne({ id });
  console.log('[DEBUG] User found:', user);

  if (!user) {
    console.log('[DEBUG] User not found for ID:', id);
    throw new Error('User not found');
  }

  return user;
}

Verify with a test:

it('should handle missing user gracefully', async () => {
  const result = await getUserById('nonexistent-id');
  // If this throws TypeError about 'id', hypothesis 1 is confirmed
});

Phase 5: FIX - Apply Minimal Fix

Once root cause is confirmed, apply the minimal fix:

// AFTER (minimal fix)
async function getUserById(id: string) {
  const user = await db.users.findOne({ id });

  if (!user) {
    throw new Error(`User not found: ${id}`);
  }

  return user;
}

Principles:

  • Fix ONLY the root cause
  • Don't refactor during bug fixing
  • Don't add features while fixing bugs
  • Keep the fix as simple as possible

Phase 6: VERIFY - Confirm Fix

Verification checklist:

# 1. Run the reproduction test
npm test -- bug-reproduction.test.ts
# Expected: PASS

# 2. Run full test suite
npm test
# Expected: All tests PASS (no regressions)

# 3. Manual verification (if applicable)
npm run dev
# Test the actual user flow that triggered the bug

# 4. Check logs are clean
tail -f logs/error.log
# Expected: No errors when performing the previously failing action

Phase 7: PREVENT - Add Regression Test

Convert your reproduction case into a permanent test:

// tests/unit/services/UserService.test.ts
describe('UserService.getUserById', () => {
  it('should throw clear error when user not found', async () => {
    const service = new UserService(mockDb);

    await expect(
      service.getUserById('nonexistent-id')
    ).rejects.toThrow('User not found: nonexistent-id');
  });

  it('should handle null user gracefully', async () => {
    mockDb.users.findOne.mockResolvedValue(null);

    await expect(
      service.getUserById('test-id')
    ).rejects.toThrow('User not found');
  });
});

3. Zero-Context Scripts

analyze_logs.py

Located at: .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py

Purpose: Parse error logs for frequency, patterns, and correlations.

Usage:

python .claude/skills/debug-root-cause-analysis/scripts/analyze_logs.py \
  --log-file logs/error.log \
  --error-pattern "TypeError|ReferenceError|null" \
  --time-range "last 24h"

Output:

Error Frequency Analysis:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
TypeError: Cannot read property 'id': 47 occurrences
  First seen: 2026-01-01 10:23:45
  Last seen:  2026-01-01 14:32:10
  Peak time:  2026-01-01 12:00-13:00 (23 errors)

  Stack trace (most common):
    at userService.ts:45
    at authMiddleware.ts:89

ReferenceError: user is not defined: 3 occurrences
  ...

4. Debugging Workflow for Agents

When you encounter a bug or error:

  1. Stop and Observe:

    • Don't immediately change code
    • Read the full error message and stack trace
    • Note the file, line number, and error type
  2. Gather Context:

    # What files are involved?
    grep -r "functionName" src/
    
    # What changed recently?
    git log --oneline -10
    
    # Are there logs?
    ls -la logs/ && tail logs/error.log
    
  3. Create Reproduction Test:

    • Write a failing test that triggers the bug
    • Make it as minimal as possible
    • Verify it actually fails
  4. Analyze Systematically:

    • Use the zero-context script to analyze logs
    • Add strategic logging (NOT shotgun console.logs)
    • Test one hypothesis at a time
  5. Apply Minimal Fix:

    • Fix only the root cause
    • Don't refactor or add features
    • Keep changes minimal and focused
  6. Verify Thoroughly:

    • Run the reproduction test
    • Run the full test suite
    • Manual verification if needed
  7. Prevent Recurrence:

    • Convert reproduction to permanent test
    • Document the root cause in comments if complex
    • Consider if similar bugs could exist elsewhere

Example Agent Session

User: "The app crashes when I try to delete a user"

Agent: "I'll debug this systematically using root cause analysis."

[OBSERVE]
Agent: Reading error logs...
Found: TypeError: Cannot read property 'posts' of undefined at deleteUser:67

[REPRODUCE]
Agent: Creating reproduction test...
Writing: tests/bugs/delete-user-crash.test.ts

[RUN TEST]
Agent: Running test to confirm reproduction...
Output: FAIL - Confirmed the crash occurs

[HYPOTHESIZE]
Agent: Stack trace shows crash at line 67 when accessing user.posts.
Hypothesis: user object is undefined before we try to access posts.
Let me check the code...

[READ CODE]
Agent: Reading userService.ts:60-70...
Found issue: deleteUser() doesn't check if user exists before accessing user.posts

[FIX]
Agent: Applying minimal fix - adding null check before accessing user.posts

[VERIFY]
Agent: Running reproduction test...
Output: PASS
Running full test suite...
Output: All tests PASS

[PREVENT]
Agent: Converting reproduction test to permanent regression test
Committing: "fix: add null check before accessing user.posts in deleteUser"

5. Failed Attempts (Negative Knowledge Evolution)

❌ Attempt: Debug by commenting out code

Context: Tried to isolate bug by commenting out sections Failure: Changed program behavior, couldn't identify actual cause Learning: Use logging and breakpoints, not code modification

❌ Attempt: Apply fix before confirming root cause

Context: Saw similar bug on Stack Overflow, applied their solution Failure: Didn't fix our bug, introduced new edge case Learning: Always verify the root cause matches before applying fixes

❌ Attempt: Add try-catch around everything

Context: Wrapped failing code in try-catch to "fix" errors Failure: Silenced errors, made debugging harder, root cause unfixed Learning: Fix the cause, don't suppress the symptom

❌ Attempt: Debug in production

Context: Couldn't reproduce locally, added logging to production Failure: Exposed sensitive data in logs, caused performance issues Learning: Reproduce locally or use proper observability tools

❌ Attempt: Multi-variable changes

Context: Changed error handling AND data validation AND logging Failure: Bug disappeared but couldn't identify which change fixed it Learning: Change one variable at a time, verify after each change

6. Common Bug Categories

Null/Undefined Errors

  • Symptom: TypeError: Cannot read property 'X' of undefined
  • Common causes: Missing null checks, async race conditions
  • Fix approach: Add null checks, ensure async operations complete

Type Errors

  • Symptom: Type mismatch, unexpected type coercion
  • Common causes: Weak typing, incorrect assumptions
  • Fix approach: Add type guards, validate inputs

Race Conditions

  • Symptom: Intermittent failures, works sometimes
  • Common causes: Async operations not awaited, shared state
  • Fix approach: Proper async/await, eliminate shared mutable state

Off-by-One Errors

  • Symptom: Array index out of bounds, fencepost errors
  • Common causes: Loop conditions, array slicing
  • Fix approach: Careful boundary analysis, add tests for edge cases

Configuration Errors

  • Symptom: Works locally, fails in production
  • Common causes: Missing env vars, different configurations
  • Fix approach: Validate configuration on startup, use same config locally

7. Governance

  • Token Budget: ~495 lines (within 500 limit)
  • Dependencies: Python 3.8+ for log analysis script
  • Pattern Origin: Scientific Method, Systematic Debugging (Andreas Zeller)
  • Maintenance: Update anti-patterns as new failure modes discovered
  • Verification Date: 2026-01-01