返回 Skill 列表
extension
分类: 开发与工程无需 API Key

error-investigation

AWS错误调查,包括多层验证、CloudWatch分析和Lambda日志模式。在调试AWS服务故障、调查生产错误或排查Lambda函数问题时使用。

person作者: jakexiaohubgithub

Error Investigation Skill

Tech Stack: AWS CLI, CloudWatch Logs, Lambda, boto3, jq

Source: Extracted from CLAUDE.md error investigation principles and AWS diagnostic patterns.


When to Use This Skill

Use the error-investigation skill when:

  • ✓ AWS service returning errors
  • ✓ Lambda function failing in production
  • ✓ CloudWatch logs showing errors
  • ✓ Service completed but operation failed
  • ✓ Silent failures (no exception but wrong result)
  • ✓ Investigating production incidents

DO NOT use this skill for:

  • ✗ Local Python debugging (use debugger instead)
  • ✗ Code refactoring (use refactor skill)
  • ✗ Performance optimization (use different skill)

Quick Investigation Decision Tree

What's failing?
├─ Lambda function?
│  ├─ Returns 200 but errors? → Check CloudWatch logs (Layer 3)
│  ├─ Timeout? → Check duration metrics + external dependencies
│  ├─ Permission denied? → Check IAM role policies
│  └─ Cold start slow? → Module-level initialization pattern
│
├─ AWS service operation?
│  ├─ DynamoDB write succeeded (200) but no data? → Check rowcount
│  ├─ S3 upload succeeded but file missing? → Check bucket policy
│  ├─ SQS message sent but not received? → Check DLQ
│  └─ Step Function succeeded but workflow incomplete? → Check state outputs
│
├─ External API call?
│  ├─ Timeout? → Check network path (security groups, VPC)
│  ├─ 403 Forbidden? → Check API key, rate limits
│  ├─ 500 Error? → Check API status page, retry logic
│  └─ Silent failure? → Inspect response payload
│
└─ Database query?
   ├─ INSERT affected 0 rows? → FK constraint, ENUM mismatch
   ├─ SELECT returns empty? → Check WHERE clause, data exists
   ├─ Connection timeout? → Security group, VPC routing
   └─ Query slow? → Missing index, full table scan

Loop Pattern: Retrying Loop → Synchronize Loop

Escalation Trigger:

  • /trace shows root cause
  • Fix applied, /validate shows success
  • But error recurs later (knowledge drift)

Tools Used:

  • /trace - Find root cause (backward trace from error)
  • /validate - Verify fix works (test the solution)
  • /consolidate - Update knowledge base (documentation, runbooks)
  • /observe - Monitor for recurring issues (drift detection)
  • /reflect - Assess if error represents pattern vs one-off

Why This Works: Error investigation fits retrying loop (find root cause, fix execution), but recurring errors trigger synchronize loop (update knowledge/documentation).

See Thinking Process Architecture - Feedback Loops for structural overview.


Core Investigation Principles

Principle 1: Execution Completion ≠ Operational Success

From CLAUDE.md:

"Execution completion ≠ Operational success. Verify actual outcomes across multiple layers, not just the absence of exceptions."

Why This Matters:

# ❌ WRONG: Assumes 200 = success
response = lambda_client.invoke(FunctionName='worker', Payload='{}')
assert response['StatusCode'] == 200  # ✗ Weak validation

# ✅ RIGHT: Multi-layer verification
response = lambda_client.invoke(FunctionName='worker', Payload='{}')

# Layer 1: Status code
assert response['StatusCode'] == 200

# Layer 2: Response payload
payload = json.loads(response['Payload'].read())
assert 'errorMessage' not in payload

# Layer 3: CloudWatch logs
logs = cloudwatch.filter_log_events(
    logGroupName='/aws/lambda/worker',
    filterPattern='ERROR'
)
assert len(logs['events']) == 0

Note: This is the AWS-specific application of Progressive Evidence Strengthening (CLAUDE.md Principle #2). The general pattern applies across all domains—here we show how it manifests in AWS Lambda/API debugging.

Principle 2: Multi-Layer Verification (AWS Application)

The Three Layers:

| Layer | Signal Strength | What It Tells You | What It DOESN'T Tell You | |-------|----------------|-------------------|--------------------------| | Status Code | Weakest | Service responded | Whether it succeeded | | Response Payload | Stronger | Function returned data | Whether logs show errors | | CloudWatch Logs | Strongest | What actually happened | Future issues |

Pattern:

# Layer 1: Status code (weakest)
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json
echo "Exit code: $?"  # 0 = AWS CLI succeeded

# Layer 2: Response payload (stronger)
if grep -q "errorMessage" /tmp/response.json; then
  echo "❌ Lambda returned error"
  exit 1
fi

# Layer 3: CloudWatch logs (strongest)
ERROR_COUNT=$(aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 120))000 \
  --filter-pattern "ERROR" \
  --query 'length(events)' --output text)

if [ "$ERROR_COUNT" -gt 0 ]; then
  echo "❌ Found errors in CloudWatch logs"
  exit 1
fi

echo "✅ All 3 layers verified"

See AWS-DIAGNOSTICS.md for AWS-specific diagnostic patterns.

Principle 3: Log Level Determines Discoverability

From CLAUDE.md:

"Log levels are not just severity indicators—they determine whether failures are discoverable by monitoring systems."

Log Level Impact:

| Log Level | Monitored? | Alerted? | Discoverable? | |-----------|------------|----------|---------------| | ERROR | ✅ Yes | ✅ Yes | ✅ Dashboards | | WARNING | ✅ Yes | ❌ No | ⚠️ Manual review | | INFO | ⚠️ Maybe | ❌ No | ❌ Active search | | DEBUG | ❌ No | ❌ No | ❌ Hidden |

Investigation Pattern:

# Step 1: Check ERROR level first
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR"

# Step 2: If no ERRORs but operation failed → Check WARNING
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "WARNING"

# Step 3: Check both application AND service logs
# - Application logs: /aws/lambda/worker
# - Service logs: Lambda execution errors, timeouts

Why This Matters:

# ❌ BAD: Error logged at WARNING (invisible to monitoring)
try:
    result = db.execute(query, params)
    if result == 0:
        logger.warning("INSERT failed")  # ⚠️  Not monitored!
except Exception as e:
    logger.warning(f"DB error: {e}")  # ⚠️  Not alerted!

# ✅ GOOD: Error logged at ERROR (visible to monitoring)
try:
    result = db.execute(query, params)
    if result == 0:
        logger.error("INSERT failed - 0 rows affected")  # ✅ Monitored
        raise ValueError("Insert operation failed")
except Exception as e:
    logger.error(f"DB error: {e}")  # ✅ Alerted
    raise

Principle 4: Lambda Logging Configuration

From CLAUDE.md:

"AWS Lambda pre-configures logging before your code runs. Never use logging.basicConfig() in Lambda handlers—it's a no-op."

The Problem:

# ❌ This does NOTHING in Lambda
import logging

logging.basicConfig(level=logging.INFO)  # No-op!
logger = logging.getLogger(__name__)
logger.info("Invisible in CloudWatch")  # Filtered out

Why It Fails:

  • Lambda runtime adds handlers to root logger BEFORE your code runs
  • basicConfig() only works if root logger has NO handlers
  • Result: INFO-level logs are invisible

The Solution:

# ✅ Works in both Lambda and local dev
import logging

root_logger = logging.getLogger()

if root_logger.handlers:  # Lambda (already configured)
    root_logger.setLevel(logging.INFO)
else:  # Local dev (needs configuration)
    logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)
logger.info("Visible in CloudWatch")  # ✅ Works

See LAMBDA-LOGGING.md for comprehensive Lambda logging patterns.


Common Investigation Scenarios

Scenario 1: Lambda Returns 200 But Has Errors

Symptom: Function completes, returns 200, but errors in logs.

Investigation Steps:

# 1. Invoke function
aws lambda invoke \
  --function-name worker \
  --payload '{"ticker": "NVDA19"}' \
  /tmp/response.json

# 2. Check response (Layer 2)
cat /tmp/response.json
# Output: {"result": {...}}  # Looks fine

# 3. Check CloudWatch logs (Layer 3)
aws logs tail /aws/lambda/worker --since 1m --filter-pattern "ERROR"

# Output:
# [ERROR] 2024-01-15 10:23:45 INSERT affected 0 rows for NVDA19
# [ERROR] 2024-01-15 10:23:46 FK constraint violation: symbol not found

Root Cause: Silent database failure (0 rowcount), logged at ERROR but caught exception.

Fix:

# Before:
def store_report(symbol, report):
    try:
        self.db.execute(query, params)
        return True  # ❌ Always returns True
    except Exception as e:
        logger.error(f"DB error: {e}")
        return True  # ❌ Still returns True!

# After:
def store_report(symbol, report):
    rowcount = self.db.execute(query, params)
    if rowcount == 0:
        logger.error(f"INSERT affected 0 rows for {symbol}")
        return False  # ✅ Returns False on failure
    return True

Scenario 2: INFO Logs Not Showing in CloudWatch

Symptom: logger.info() calls not appearing in CloudWatch.

Investigation Steps:

# 1. Check current log level
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 300))000 \
  --filter-pattern "INFO"

# No results (but INFO logs exist in code)

# 2. Check root logger configuration
# Add to Lambda handler:
import logging
print(f"Root logger level: {logging.getLogger().level}")
print(f"Root logger handlers: {logging.getLogger().handlers}")

Root Cause: Root logger set to WARNING, filters out INFO.

Fix:

# handler.py (entry point)
import logging

# Configure logging at module level
root_logger = logging.getLogger()

if root_logger.handlers:  # Lambda environment
    root_logger.setLevel(logging.INFO)  # ✅ Set root logger level
else:  # Local development
    logging.basicConfig(level=logging.INFO)

logger = logging.getLogger(__name__)

def lambda_handler(event, context):
    logger.info("Handler invoked")  # Now visible
    # ...

See LAMBDA-LOGGING.md#troubleshooting for complete debugging guide.

Scenario 3: Lambda Timeout with Network Operations

Symptom: Lambda times out after long execution (600s+), logs show "PDF generation..." but no completion message.

Investigation Steps:

# 1. Check execution duration pattern
aws logs filter-log-events \
  --log-group-name /aws/lambda/pdf-worker \
  --filter-pattern "Duration:" \
  --query 'events[*].message' \
  | grep -o "Duration: [0-9]*" \
  | sort -n

# Look for pattern:
# - First 5 requests: Duration: 2-3s
# - Last 5 requests: Duration: 600s+ (timeout)

# 2. Check for connection timeout errors
aws logs filter-log-events \
  --log-group-name /aws/lambda/pdf-worker \
  --filter-pattern "ConnectTimeoutError" \
  --query 'events[*].message'

# Output:
# botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL:
# "https://bucket.s3.region.amazonaws.com/..."

# 3. Analyze timeline (deterministic vs random)
aws logs tail /aws/lambda/pdf-worker --since 30m | \
  grep -E "START RequestId|✅ PDF job completed|ConnectTimeoutError" | \
  awk '{print $1, $2, $NF}' | sort

# Deterministic pattern (first N succeed, last M fail) = infrastructure bottleneck
# Random pattern (scattered failures) = performance issue

Root Cause Analysis:

# 4. Check VPC configuration
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-xxx" \
            "Name=service-name,Values=com.amazonaws.region.s3"

# If empty → No S3 VPC Endpoint (traffic goes through NAT Gateway)

# 5. Verify NAT Gateway routing
aws ec2 describe-route-tables \
  --filters "Name=vpc-id,Values=vpc-xxx" \
  --query 'RouteTables[*].Routes[?GatewayId!=`local`]'

# If route 0.0.0.0/0 → nat-xxx → NAT Gateway saturated with concurrent connections

Root Cause: NAT Gateway connection saturation. When N concurrent Lambdas upload to S3:

  • NAT Gateway has limited connection establishment rate
  • First N connections succeed (2-3s upload time)
  • Remaining connections queue and timeout (600s = boto3 default timeout + retries)
  • Pattern is deterministic (always first N succeed, last M fail)

Fix:

# terraform/s3_vpc_endpoint.tf
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = data.aws_vpc.default.id
  service_name      = "com.amazonaws.${var.aws_region}.s3"
  vpc_endpoint_type = "Gateway"

  route_table_ids = data.aws_route_tables.vpc_route_tables.ids

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = "*"
      Action    = "s3:*"
      Resource  = "*"
    }]
  })
}

Why This Works:

  • S3 Gateway Endpoint adds routes to VPC route tables
  • S3 traffic bypasses NAT Gateway (direct AWS network path)
  • No connection establishment limits
  • FREE (Gateway endpoints have no hourly charge)
  • 200x faster (2-3s vs 600s timeout)

Verification:

# 1. Deploy VPC endpoint
cd terraform && terraform apply

# 2. Verify endpoint created
terraform output s3_vpc_endpoint_state  # Should be "available"

# 3. Test full workflow
aws stepfunctions start-execution \
  --state-machine-arn <pdf-workflow-arn> \
  --input '{"report_date":"2026-01-05"}'

# 4. Monitor for 100% success rate
aws logs tail /aws/lambda/pdf-worker --follow

# Expected: All PDFs complete in 2-3s, no timeouts

Critical Insight: Execution Time ≠ Hang Location

  • 600s execution time doesn't mean code hangs for 600s
  • It means ENTIRE execution (including network timeout) took 600s
  • Check stack traces (Layer 3) to find WHERE timeout occurs
  • Don't assume "logs stop at line X" = "code hangs at line X" (logs lost when Lambda fails)

Pattern Recognition:

  • Deterministic failure (first N succeed, last M fail) → Infrastructure bottleneck (NAT, VPC endpoint)
  • Random failure (scattered across all attempts) → Performance issue (slow API, memory pressure)
  • All fail → Configuration issue (missing permissions, wrong endpoint)

See Bug Hunt Report for complete investigation.

Scenario 4: DynamoDB PutItem Succeeds But No Data

Symptom: put_item() returns 200, but item not in table.

Investigation Steps:

# 1. Check response
response = table.put_item(Item={'ticker': 'NVDA19', 'data': {...}})
print(f"HTTP Status: {response['ResponseMetadata']['HTTPStatusCode']}")
# Output: 200

# 2. Verify item exists
response = table.get_item(Key={'ticker': 'NVDA19'})
print(response.get('Item'))
# Output: None (no item!)

# 3. Check for conditional write
response = table.put_item(
    Item={'ticker': 'NVDA19', 'data': {...}},
    ConditionExpression='attribute_not_exists(ticker)'  # ← Condition failed?
)

Root Cause: Conditional expression failed silently.

Fix:

# Before:
response = table.put_item(Item=item)  # ❌ No verification

# After:
try:
    response = table.put_item(Item=item)

    # Verify write
    verify = table.get_item(Key={'ticker': item['ticker']})
    if 'Item' not in verify:
        logger.error(f"Item not found after put_item: {item['ticker']}")
        raise ValueError("DynamoDB write verification failed")

except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
        logger.warning(f"Conditional write failed: {item['ticker']}")
    else:
        logger.error(f"DynamoDB error: {e}")
        raise

AWS Boundary Verification

When to apply: Distributed system errors (Lambda, Aurora, S3, SQS, Step Functions)

Problem: Code looks correct locally but fails in AWS due to unverified execution boundaries

Common boundary-related error patterns:

Pattern 1: Missing Environment Variable

# Error: KeyError: 'AURORA_HOST'
# Symptom: Lambda invocation fails immediately

# Root cause: Boundary violation (code → runtime)
# Code expects: os.environ['AURORA_HOST']
# Runtime provides: No such variable

# Verification:
aws lambda get-function-configuration \
  --function-name [PROJECT_NAME]-worker-dev \
  --query 'Environment.Variables'

# Compare with: Code's os.environ accesses
grep "os.environ" src/lambda_handler.py

Pattern 2: Aurora Schema Mismatch

# Error: Unknown column 'pdf_s3_key' in 'field list'
# Symptom: INSERT query fails in production

# Root cause: Boundary violation (code → database)
# Code sends: INSERT INTO reports (symbol, pdf_s3_key)
# Aurora has: No pdf_s3_key column

# Verification:
mysql> SHOW COLUMNS FROM precomputed_reports;

# Compare with: Code's INSERT statements
grep "INSERT INTO" src/data/aurora/precompute_service.py

Pattern 3: Lambda Timeout

# Error: Task timed out after 30.00 seconds
# Symptom: Lambda stops mid-execution

# Root cause: Configuration mismatch (code requirements vs entity config)
# Code requires: 60s API call + 45s processing = 105s total
# Lambda configured: 30s timeout

# Verification:
aws lambda get-function-configuration \
  --function-name [PROJECT_NAME]-worker-dev \
  --query '{Timeout:Timeout, Memory:MemorySize}'

# Analyze code execution time:
grep "requests.get.*timeout" src/ -r  # External API timeouts
# Sum: timeout values + processing overhead

Pattern 4: Permission Denied

# Error: AccessDeniedException: User is not authorized to perform: s3:PutObject
# Symptom: S3 upload fails

# Root cause: Permission boundary violation (principal → resource)
# Code tries: s3.put_object(Bucket='reports', Key='file.pdf')
# IAM role allows: Only s3:GetObject (read-only)

# Verification:
aws iam get-role-policy \
  --role-name [PROJECT_NAME]-worker-role-dev \
  --policy-name S3Access

# Compare with: Code's boto3 operations
grep "s3.*put_object\|s3.*upload" src/ -r

Pattern 5: Intention Violation

# Error: API Gateway timeout after 30 seconds
# Symptom: Client sees timeout, Lambda still processing

# Root cause: Usage doesn't match intention (sync Lambda used for async work)
# Entity designed for: Synchronous API (< 30s response)
# Code uses it for: Long-running report generation (60s)

# Verification:
# Check Terraform comments
cat terraform/lambdas.tf | grep -B 5 -A 10 "api-handler"

# Check Lambda invocation type
aws lambda get-function-configuration \
  --function-name api-handler \
  --query 'Timeout'
# Compare: API Gateway 30s limit vs Lambda timeout

Boundary verification workflow for AWS errors:

1. Identify error type → Map to boundary category
   - Missing env var → Process boundary (code → runtime)
   - Schema mismatch → Data boundary (code → database)
   - Timeout → Configuration boundary (requirements → entity config)
   - Permission denied → Permission boundary (principal → resource)
   - API Gateway timeout → Intention boundary (usage → design)

2. Identify physical entities involved
   - WHICH Lambda (name, ARN)
   - WHICH Aurora cluster (endpoint, database)
   - WHICH S3 bucket (name, region)
   - WHICH IAM role (name, policies)

3. Verify contract at boundary
   - Code expectations → Infrastructure reality
   - Use aws cli to inspect actual configuration
   - Compare code requirements vs entity properties

4. Apply Progressive Evidence Strengthening
   - Layer 1 (Surface): Error message
   - Layer 2 (Content): CloudWatch logs
   - Layer 3 (Observability): AWS resource configuration
   - Layer 4 (Ground Truth): Test actual execution

Integration with investigation workflow:

  • Step 1 (Identify Error Layer): Check if error is boundary-related
  • Step 2 (Collect Context): Identify which boundary violated
  • Step 3 (Check Changes): Did code or infrastructure change?
  • Step 4 (Fix): Repair boundary contract (update code or infrastructure)

See: Execution Boundary Checklist for systematic AWS boundary verification

Related:

  • Principle #20 (Execution Boundary Discipline) - CLAUDE.md
  • Principle #2 (Progressive Evidence Strengthening) - Multi-layer verification
  • Principle #15 (Infrastructure-Application Contract) - Sync code and infra

Investigation Workflow

Step 1: Identify Error Layer (5 minutes)

# Check all three layers
aws lambda invoke --function-name worker --payload '{}' /tmp/response.json

# Layer 1: Exit code
echo "Exit code: $?"

# Layer 2: Response payload
cat /tmp/response.json | jq .

# Layer 3: CloudWatch logs
aws logs tail /aws/lambda/worker --since 5m --filter-pattern "ERROR"

Questions:

  • Which layer shows the error?
  • If Layer 1 OK but Layer 3 ERROR → Silent failure
  • If all layers OK but wrong result → Logic error

Step 2: Collect Error Context (10 minutes)

# Get full error details
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --start-time $(($(date +%s) - 3600))000 \
  --filter-pattern "ERROR" \
  --query 'events[*].[timestamp,message]' \
  --output table

# Get surrounding context (±5 lines)
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR" \
  | jq -r '.events[0].message' \
  | grep -C 5 "ERROR"

Step 3: Check Recent Changes (5 minutes)

# When did errors start?
aws logs filter-log-events \
  --log-group-name /aws/lambda/worker \
  --filter-pattern "ERROR" \
  --query 'events[0].timestamp' \
  --output text

# What deployed around that time?
gh run list --limit 10

# What changed in code?
git log --since="2 hours ago" --oneline

Step 4: Reproduce and Fix (variable)

See AWS-DIAGNOSTICS.md for service-specific diagnostic patterns.


Quick Reference

Investigation Priority

  1. Check CloudWatch logs (Layer 3 - strongest signal)
  2. Check response payload (Layer 2 - structured errors)
  3. Check status code (Layer 1 - weakest signal)
  4. Verify actual outcome (database state, S3 files, etc.)

Common Failure Modes

| Symptom | Likely Cause | Investigation | |---------|--------------|---------------| | 200 OK but errors in logs | Silent failure | Check rowcount, verify writes | | INFO logs not showing | Root logger level = WARNING | Set root logger to INFO | | Timeout | Cold start, external API slow | Check duration metrics | | Permission denied | IAM policy missing | Simulate permissions | | 0 rows affected | FK constraint, ENUM mismatch | Check constraints |


File Organization

.claude/skills/error-investigation/
├── SKILL.md              # This file (entry point)
├── AWS-DIAGNOSTICS.md    # AWS-specific diagnostic patterns
└── LAMBDA-LOGGING.md     # Lambda logging configuration guide

Next Steps


References