Error Investigation Skill

Tech Stack: AWS CLI, CloudWatch Logs, Lambda, boto3, jq

Source: Extracted from CLAUDE.md error investigation principles and AWS diagnostic patterns.

When to Use This Skill

Use the error-investigation skill when:

✓ AWS service returning errors
✓ Lambda function failing in production
✓ CloudWatch logs showing errors
✓ Service completed but operation failed
✓ Silent failures (no exception but wrong result)
✓ Investigating production incidents

DO NOT use this skill for:

✗ Local Python debugging (use debugger instead)
✗ Code refactoring (use refactor skill)
✗ Performance optimization (use different skill)

Quick Investigation Decision Tree

What's failing?
├─ Lambda function?
│  ├─ Returns 200 but errors? → Check CloudWatch logs (Layer 3)
│  ├─ Timeout? → Check duration metrics + external dependencies
│  ├─ Permission denied? → Check IAM role policies
│  └─ Cold start slow? → Module-level initialization pattern
│
├─ AWS service operation?
│  ├─ DynamoDB write succeeded (200) but no data? → Check rowcount
│  ├─ S3 upload succeeded but file missing? → Check bucket policy
│  ├─ SQS message sent but not received? → Check DLQ
│  └─ Step Function succeeded but workflow incomplete? → Check state outputs
│
├─ External API call?
│  ├─ Timeout? → Check network path (security groups, VPC)
│  ├─ 403 Forbidden? → Check API key, rate limits
│  ├─ 500 Error? → Check API status page, retry logic
│  └─ Silent failure? → Inspect response payload
│
└─ Database query?
   ├─ INSERT affected 0 rows? → FK constraint, ENUM mismatch
   ├─ SELECT returns empty? → Check WHERE clause, data exists
   ├─ Connection timeout? → Security group, VPC routing
   └─ Query slow? → Missing index, full table scan

Loop Pattern: Retrying Loop → Synchronize Loop

Escalation Trigger:

/trace shows root cause
Fix applied, /validate shows success
But error recurs later (knowledge drift)

Tools Used:

/trace - Find root cause (backward trace from error)
/validate - Verify fix works (test the solution)
/consolidate - Update knowledge base (documentation, runbooks)
/observe - Monitor for recurring issues (drift detection)
/reflect - Assess if error represents pattern vs one-off

Why This Works: Error investigation fits retrying loop (find root cause, fix execution), but recurring errors trigger synchronize loop (update knowledge/documentation).

See Thinking Process Architecture - Feedback Loops for structural overview.

Core Investigation Principles

Principle 1: Execution Completion ≠ Operational Success

From CLAUDE.md:

"Execution completion ≠ Operational success. Verify actual outcomes across multiple layers, not just the absence of exceptions."

Why This Matters:

# ❌ WRONG: Assumes 200 = success
response = lambda_client.invoke(FunctionName='worker', Payload='{}')
assert response['StatusCode'] == 200  # ✗ Weak validation

# ✅ RIGHT: Multi-layer verification
response = lambda_client.invoke(FunctionName='worker', Payload='{}')

# Layer 1: Status code
assert response['StatusCode'] == 200

# Layer 2: Response payload
payload = json.loads(response['Payload'].read())
assert 'errorMessage' not in payload

# Layer 3: CloudWatch logs
logs = cloudwatch.filter_log_events(
    logGroupName='/aws/lambda/worker',
    filterPattern='ERROR'
)
assert len(logs['events']) == 0

Note: This is the AWS-specific application of Progressive Evidence Strengthening (CLAUDE.md Principle #2). The general pattern applies across all domains—here we show how it manifests in AWS Lambda/API debugging.

Principle 2: Multi-Layer Verification (AWS Application)

The Three Layers:

| Layer | Signal Strength | What It Tells You | What It DOESN'T Tell You | |-------|----------------|-------------------|--------------------------| | Status Code | Weakest | Service responded | Whether it succeeded | | Response Payload | Stronger | Function returned data | Whether logs show errors | | CloudWatch Logs | Strongest | What actually happened | Future issues |

Pattern:

# Layer 1: Status code (weakest)
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json
echo "Exit code: $?"  # 0 = AWS CLI succeeded

# Layer 2: Response payload (stronger)
if grep -q "errorMessage" /tmp/response.json; then
  echo "❌ Lambda returned error"
  exit 1
fi

# Layer 3: CloudWatch logs (strongest)
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 120))000 \
  --filter-pattern "ERROR" \
  --query 'length(events)' --output text)

if [ "$ERROR_COUNT" -gt 0 ]; then
  echo "❌ Found errors in CloudWatch logs"
  exit 1
fi

echo "✅ All 3 layers verified"

See AWS-DIAGNOSTICS.md for AWS-specific diagnostic patterns.

Principle 3: Log Level Determines Discoverability

From CLAUDE.md:

"Log levels are not just severity indicators—they determine whether failures are discoverable by monitoring systems."

Log Level Impact:

| Log Level | Monitored? | Alerted? | Discoverable? | |-----------|------------|----------|---------------| | ERROR | ✅ Yes | ✅ Yes | ✅ Dashboards | | WARNING | ✅ Yes | ❌ No | ⚠️ Manual review | | INFO | ⚠️ Maybe | ❌ No | ❌ Active search | | DEBUG | ❌ No | ❌ No | ❌ Hidden |

Investigation Pattern:

# Step 1: Check ERROR level first
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR"

# Step 2: If no ERRORs but operation failed → Check WARNING
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "WARNING"

# Step 3: Check both application AND service logs
# - Application logs: /aws/lambda/worker
# - Service logs: Lambda execution errors, timeouts

Why This Matters:

# ❌ BAD: Error logged at WARNING (invisible to monitoring)
try:
    result = db.execute(query, params)
    if result == 0:
        logger.warning("INSERT failed")  # ⚠️  Not monitored!
except Exception as e:
    logger.warning(f"DB error: {e}")  # ⚠️  Not alerted!

# ✅ GOOD: Error logged at ERROR (visible to monitoring)
try:
    result = db.execute(query, params)
    if result == 0:
        logger.error("INSERT failed - 0 rows affected")  # ✅ Monitored
        raise ValueError("Insert operation failed")
except Exception as e:
    logger.error(f"DB error: {e}")  # ✅ Alerted
    raise

Principle 4: Lambda Logging Configuration

From CLAUDE.md:

"AWS Lambda pre-configures logging before your code runs. Never use logging.basicConfig() in Lambda handlers—it's a no-op."

The Problem:

# ❌ This does NOTHING in Lambda
import logging

logging.basicConfig(level=logging.INFO)  # No-op!
logger = logging.getLogger(__name__)
logger.info("Invisible in CloudWatch")  # Filtered out

Why It Fails:

Lambda runtime adds handlers to root logger BEFORE your code runs
basicConfig() only works if root logger has NO handlers
Result: INFO-level logs are invisible

The Solution:

# ✅ Works in both Lambda and local dev
import logging

root_logger = logging.getLogger()

if root_logger.handlers:  # Lambda (already configured)
    root_logger.setLevel(logging.INFO)
else:  # Local dev (needs configuration)
    logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)
logger.info("Visible in CloudWatch")  # ✅ Works

See LAMBDA-LOGGING.md for comprehensive Lambda logging patterns.

Common Investigation Scenarios

Scenario 1: Lambda Returns 200 But Has Errors

Symptom: Function completes, returns 200, but errors in logs.

Investigation Steps:

# 1. Invoke function
aws lambda invoke \
  --function-name worker \
  --payload '{"ticker": "NVDA19"}' \
  /tmp/response.json

# 2. Check response (Layer 2)
cat /tmp/response.json
# Output: {"result": {...}}  # Looks fine

# 3. Check CloudWatch logs (Layer 3)
aws logs tail /aws/lambda/worker --since 1m --filter-pattern "ERROR"

# Output:
# [ERROR] 2024-01-15 10:23:45 INSERT affected 0 rows for NVDA19
# [ERROR] 2024-01-15 10:23:46 FK constraint violation: symbol not found

Root Cause: Silent database failure (0 rowcount), logged at ERROR but caught exception.

Fix:

# Before:
def store_report(symbol, report):
    try:
        self.db.execute(query, params)
        return True  # ❌ Always returns True
    except Exception as e:
        logger.error(f"DB error: {e}")
        return True  # ❌ Still returns True!

# After:
def store_report(symbol, report):
    rowcount = self.db.execute(query, params)
    if rowcount == 0:
        logger.error(f"INSERT affected 0 rows for {symbol}")
        return False  # ✅ Returns False on failure
    return True

Scenario 2: INFO Logs Not Showing in CloudWatch

Symptom: logger.info() calls not appearing in CloudWatch.

Investigation Steps:

# 1. Check current log level
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 300))000 \
  --filter-pattern "INFO"

# No results (but INFO logs exist in code)

# 2. Check root logger configuration
# Add to Lambda handler:
import logging
print(f"Root logger level: {logging.getLogger().level}")
print(f"Root logger handlers: {logging.getLogger().handlers}")

Root Cause: Root logger set to WARNING, filters out INFO.

Fix:

# handler.py (entry point)
import logging

# Configure logging at module level
root_logger = logging.getLogger()

if root_logger.handlers:  # Lambda environment
    root_logger.setLevel(logging.INFO)  # ✅ Set root logger level
else:  # Local development
    logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

def lambda_handler(event, context):
    logger.info("Handler invoked")  # Now visible
    # ...

See LAMBDA-LOGGING.md#troubleshooting for complete debugging guide.

Scenario 3: Lambda Timeout with Network Operations

Symptom: Lambda times out after long execution (600s+), logs show "PDF generation..." but no completion message.

Investigation Steps:

# 1. Check execution duration pattern
aws logs filter-log-events \
  --log-group-name /aws/lambda/pdf-worker \
  --filter-pattern "Duration:" \
  --query 'events[*].message' \
  | grep -o "Duration: [0-9]*" \
  | sort -n

# Look for pattern:
# - First 5 requests: Duration: 2-3s
# - Last 5 requests: Duration: 600s+ (timeout)

# 2. Check for connection timeout errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/pdf-worker \
  --filter-pattern "ConnectTimeoutError" \
  --query 'events[*].message'

# Output:
# botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
# "https://bucket.s3.region.amazonaws.com/..."

# 3. Analyze timeline (deterministic vs random)
aws logs tail /aws/lambda/pdf-worker --since 30m | \
  grep -E "START RequestId|✅ PDF job completed|ConnectTimeoutError" | \
  awk '{print $1, $2, $NF}' | sort

# Deterministic pattern (first N succeed, last M fail) = infrastructure bottleneck
# Random pattern (scattered failures) = performance issue

Root Cause Analysis:

# 4. Check VPC configuration
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-xxx" \
            "Name=service-name,Values=com.amazonaws.region.s3"

# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)

# 5. Verify NAT Gateway routing
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxx" \
  --query 'RouteTables[*].Routes[?GatewayId!=`local`]'

# If route 0.0.0.0/0 → nat-xxx → NAT Gateway saturated with concurrent connections

Root Cause: NAT Gateway connection saturation. When N concurrent Lambdas upload to S3:

NAT Gateway has limited connection establishment rate
First N connections succeed (2-3s upload time)
Remaining connections queue and timeout (600s = boto3 default timeout + retries)
Pattern is deterministic (always first N succeed, last M fail)

Fix:

# terraform/s3_vpc_endpoint.tf
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = data.aws_vpc.default.id
  service_name      = "com.amazonaws.${var.aws_region}.s3"
  vpc_endpoint_type = "Gateway"

  route_table_ids = data.aws_route_tables.vpc_route_tables.ids

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = "s3:*"
      Resource  = "*"
    }]
  })
}

Why This Works:

S3 Gateway Endpoint adds routes to VPC route tables
S3 traffic bypasses NAT Gateway (direct AWS network path)
No connection establishment limits
FREE (Gateway endpoints have no hourly charge)
200x faster (2-3s vs 600s timeout)

Verification:

# 1. Deploy VPC endpoint
cd terraform && terraform apply

# 2. Verify endpoint created
terraform output s3_vpc_endpoint_state  # Should be "available"

# 3. Test full workflow
aws stepfunctions start-execution \
  --state-machine-arn <pdf-workflow-arn> \
  --input '{"report_date":"2026-01-05"}'

# 4. Monitor for 100% success rate
aws logs tail /aws/lambda/pdf-worker --follow

# Expected: All PDFs complete in 2-3s, no timeouts

Critical Insight: Execution Time ≠ Hang Location

600s execution time doesn't mean code hangs for 600s
It means ENTIRE execution (including network timeout) took 600s
Check stack traces (Layer 3) to find WHERE timeout occurs
Don't assume "logs stop at line X" = "code hangs at line X" (logs lost when Lambda fails)

Pattern Recognition:

Deterministic failure (first N succeed, last M fail) → Infrastructure bottleneck (NAT, VPC endpoint)
Random failure (scattered across all attempts) → Performance issue (slow API, memory pressure)
All fail → Configuration issue (missing permissions, wrong endpoint)

See Bug Hunt Report for complete investigation.

Scenario 4: DynamoDB PutItem Succeeds But No Data

Symptom: put_item() returns 200, but item not in table.

Investigation Steps:

# 1. Check response
response = table.put_item(Item={'ticker': 'NVDA19', 'data': {...}})
print(f"HTTP Status: {response['ResponseMetadata']['HTTPStatusCode']}")
# Output: 200

# 2. Verify item exists
response = table.get_item(Key={'ticker': 'NVDA19'})
print(response.get('Item'))
# Output: None (no item!)

# 3. Check for conditional write
response = table.put_item(
    Item={'ticker': 'NVDA19', 'data': {...}},
    ConditionExpression='attribute_not_exists(ticker)'  # ← Condition failed?
)

Root Cause: Conditional expression failed silently.

Fix:

# Before:
response = table.put_item(Item=item)  # ❌ No verification

# After:
try:
    response = table.put_item(Item=item)

    # Verify write
    verify = table.get_item(Key={'ticker': item['ticker']})
    if 'Item' not in verify:
        logger.error(f"Item not found after put_item: {item['ticker']}")
        raise ValueError("DynamoDB write verification failed")

except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
        logger.warning(f"Conditional write failed: {item['ticker']}")
    else:
        logger.error(f"DynamoDB error: {e}")
        raise

AWS Boundary Verification

When to apply: Distributed system errors (Lambda, Aurora, S3, SQS, Step Functions)

Problem: Code looks correct locally but fails in AWS due to unverified execution boundaries

Common boundary-related error patterns:

Pattern 1: Missing Environment Variable

# Error: KeyError: 'AURORA_HOST'
# Symptom: Lambda invocation fails immediately

# Root cause: Boundary violation (code → runtime)
# Code expects: os.environ['AURORA_HOST']
# Runtime provides: No such variable

# Verification:
aws lambda get-function-configuration \
  --function-name [PROJECT_NAME]-worker-dev \
  --query 'Environment.Variables'

# Compare with: Code's os.environ accesses
grep "os.environ" src/lambda_handler.py

Pattern 2: Aurora Schema Mismatch

# Error: Unknown column 'pdf_s3_key' in 'field list'
# Symptom: INSERT query fails in production

# Root cause: Boundary violation (code → database)
# Code sends: INSERT INTO reports (symbol, pdf_s3_key)
# Aurora has: No pdf_s3_key column

# Verification:
mysql> SHOW COLUMNS FROM precomputed_reports;

# Compare with: Code's INSERT statements
grep "INSERT INTO" src/data/aurora/precompute_service.py

Pattern 3: Lambda Timeout

# Error: Task timed out after 30.00 seconds
# Symptom: Lambda stops mid-execution

# Root cause: Configuration mismatch (code requirements vs entity config)
# Code requires: 60s API call + 45s processing = 105s total
# Lambda configured: 30s timeout

# Verification:
aws lambda get-function-configuration \
  --function-name [PROJECT_NAME]-worker-dev \
  --query '{Timeout:Timeout, Memory:MemorySize}'

# Analyze code execution time:
grep "requests.get.*timeout" src/ -r  # External API timeouts
# Sum: timeout values + processing overhead

Pattern 4: Permission Denied

# Error: AccessDeniedException: User is not authorized to perform: s3:PutObject
# Symptom: S3 upload fails

# Root cause: Permission boundary violation (principal → resource)
# Code tries: s3.put_object(Bucket='reports', Key='file.pdf')
# IAM role allows: Only s3:GetObject (read-only)

# Verification:
aws iam get-role-policy \
  --role-name [PROJECT_NAME]-worker-role-dev \
  --policy-name S3Access

# Compare with: Code's boto3 operations
grep "s3.*put_object\|s3.*upload" src/ -r

Pattern 5: Intention Violation

# Error: API Gateway timeout after 30 seconds
# Symptom: Client sees timeout, Lambda still processing

# Root cause: Usage doesn't match intention (sync Lambda used for async work)
# Entity designed for: Synchronous API (< 30s response)
# Code uses it for: Long-running report generation (60s)

# Verification:
# Check Terraform comments
cat terraform/lambdas.tf | grep -B 5 -A 10 "api-handler"

# Check Lambda invocation type
aws lambda get-function-configuration \
  --function-name api-handler \
  --query 'Timeout'
# Compare: API Gateway 30s limit vs Lambda timeout

Boundary verification workflow for AWS errors:

1. Identify error type → Map to boundary category
   - Missing env var → Process boundary (code → runtime)
   - Schema mismatch → Data boundary (code → database)
   - Timeout → Configuration boundary (requirements → entity config)
   - Permission denied → Permission boundary (principal → resource)
   - API Gateway timeout → Intention boundary (usage → design)

2. Identify physical entities involved
   - WHICH Lambda (name, ARN)
   - WHICH Aurora cluster (endpoint, database)
   - WHICH S3 bucket (name, region)
   - WHICH IAM role (name, policies)

3. Verify contract at boundary
   - Code expectations → Infrastructure reality
   - Use aws cli to inspect actual configuration
   - Compare code requirements vs entity properties

4. Apply Progressive Evidence Strengthening
   - Layer 1 (Surface): Error message
   - Layer 2 (Content): CloudWatch logs
   - Layer 3 (Observability): AWS resource configuration
   - Layer 4 (Ground Truth): Test actual execution

Integration with investigation workflow:

Step 1 (Identify Error Layer): Check if error is boundary-related
Step 2 (Collect Context): Identify which boundary violated
Step 3 (Check Changes): Did code or infrastructure change?
Step 4 (Fix): Repair boundary contract (update code or infrastructure)

See: Execution Boundary Checklist for systematic AWS boundary verification

Related:

Principle #20 (Execution Boundary Discipline) - CLAUDE.md
Principle #2 (Progressive Evidence Strengthening) - Multi-layer verification
Principle #15 (Infrastructure-Application Contract) - Sync code and infra

Investigation Workflow

Step 1: Identify Error Layer (5 minutes)

# Check all three layers
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json

# Layer 1: Exit code
echo "Exit code: $?"

# Layer 2: Response payload
cat /tmp/response.json | jq .

# Layer 3: CloudWatch logs
aws logs tail /aws/lambda/worker --since 5m --filter-pattern "ERROR"

Questions:

Which layer shows the error?
If Layer 1 OK but Layer 3 ERROR → Silent failure
If all layers OK but wrong result → Logic error

Step 2: Collect Error Context (10 minutes)

# Get full error details
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 3600))000 \
  --filter-pattern "ERROR" \
  --query 'events[*].[timestamp,message]' \
  --output table

# Get surrounding context (±5 lines)
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR" \
  | jq -r '.events[0].message' \
  | grep -C 5 "ERROR"

Step 3: Check Recent Changes (5 minutes)

# When did errors start?
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR" \
  --query 'events[0].timestamp' \
  --output text

# What deployed around that time?
gh run list --limit 10

# What changed in code?
git log --since="2 hours ago" --oneline

Step 4: Reproduce and Fix (variable)

See AWS-DIAGNOSTICS.md for service-specific diagnostic patterns.

Quick Reference

Investigation Priority

Check CloudWatch logs (Layer 3 - strongest signal)
Check response payload (Layer 2 - structured errors)
Check status code (Layer 1 - weakest signal)
Verify actual outcome (database state, S3 files, etc.)

Common Failure Modes

| Symptom | Likely Cause | Investigation | |---------|--------------|---------------| | 200 OK but errors in logs | Silent failure | Check rowcount, verify writes | | INFO logs not showing | Root logger level = WARNING | Set root logger to INFO | | Timeout | Cold start, external API slow | Check duration metrics | | Permission denied | IAM policy missing | Simulate permissions | | 0 rows affected | FK constraint, ENUM mismatch | Check constraints |

File Organization

.claude/skills/error-investigation/
├── SKILL.md              # This file (entry point)
├── AWS-DIAGNOSTICS.md    # AWS-specific diagnostic patterns
└── LAMBDA-LOGGING.md     # Lambda logging configuration guide

Next Steps

For AWS diagnostics: See AWS-DIAGNOSTICS.md
For Lambda logging: See LAMBDA-LOGGING.md
For general debugging: See research skill

error-investigation

Error Investigation Skill

When to Use This Skill

Quick Investigation Decision Tree

Loop Pattern: Retrying Loop → Synchronize Loop

Core Investigation Principles

Principle 1: Execution Completion ≠ Operational Success

Principle 2: Multi-Layer Verification (AWS Application)

Principle 3: Log Level Determines Discoverability

Principle 4: Lambda Logging Configuration

Common Investigation Scenarios

Scenario 1: Lambda Returns 200 But Has Errors

Scenario 2: INFO Logs Not Showing in CloudWatch

Scenario 3: Lambda Timeout with Network Operations

Scenario 4: DynamoDB PutItem Succeeds But No Data

AWS Boundary Verification

Pattern 1: Missing Environment Variable

Pattern 2: Aurora Schema Mismatch

Pattern 3: Lambda Timeout

Pattern 4: Permission Denied

Pattern 5: Intention Violation

Investigation Workflow

Step 1: Identify Error Layer (5 minutes)

Step 2: Collect Error Context (10 minutes)

Step 3: Check Recent Changes (5 minutes)

Step 4: Reproduce and Fix (variable)

Quick Reference

Investigation Priority

Common Failure Modes

File Organization

Next Steps

References