返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Logging Observability

用于构建可观测系统的结构化日志、分布式追踪和指标采集模式。适用于实现日志基础设施、OpenTelemetry 分布式追踪、指标采集设计(RED/USE 方法)、告警仪表盘配置及可观测性审查。涵盖结构化 JSON 日志、上下文传播、链路采样、Prometheus/Grafana 技术栈、告警设计及敏感信息脱敏。

person作者: wpankhubclawhub

Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

| Pillar | Purpose | Question It Answers | Example | |--------|---------|---------------------|---------| | Logs | What happened | Why did this request fail? | {"level":"error","msg":"payment declined","user_id":"u_82"} | | Metrics | How much / how fast | Is latency increasing? | http_request_duration_seconds{route="/api/orders"} 0.342 | | Traces | Request flow | Where is the bottleneck? | Span: api-gateway → auth → order-service → db |

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.


Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

| Field | Purpose | Required | |-------|---------|----------| | timestamp | ISO-8601 with milliseconds | Yes | | level | Severity (DEBUG … FATAL) | Yes | | service | Originating service name | Yes | | message | Human-readable description | Yes | | trace_id | Distributed trace correlation | Yes | | span_id | Current span within trace | Yes | | correlation_id | Business-level correlation (order ID) | When applicable | | error | Structured error object | On errors | | context | Request-specific metadata | Recommended |

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});

Library Recommendations

| Library | Language | Strengths | Perf | |---------|----------|-----------|------| | Pino | Node.js | Fastest Node logger, low overhead | Excellent | | structlog | Python | Composable processors, context binding | Good | | zerolog | Go | Zero-allocation JSON logging | Excellent | | zap | Go | High performance, typed fields | Excellent | | tracing | Rust | Spans + events, async-aware | Excellent |

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.


Log Levels

| Level | When to Use | Example | |-------|-------------|---------| | FATAL | App cannot continue, process will exit | Database connection pool exhausted | | ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED | | WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout | | INFO | Normal business events | Order ORD-1234 placed successfully | | DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences | | TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.


Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span Creation

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

  • Use W3C Trace Context (traceparent header) — default in OTel
  • Propagate across HTTP, gRPC, and message queues
  • For async workers: serialise traceparent into the job payload

Trace Sampling

| Strategy | Use When | |----------|----------| | Always On | Low-traffic services, debugging | | Probabilistic (N%) | General production use | | Rate-limited (N/sec) | High-throughput services | | Tail-based | When you need all error traces |

Always sample 100% of error traces regardless of strategy.


Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

| Metric | What It Measures | Prometheus Example | |--------|-----------------|-------------------| | Rate | Requests/sec | rate(http_requests_total[5m]) | | Errors | Failed request ratio | rate(http_requests_total{status=~"5.."}[5m]) | | Duration | Response time | histogram_quantile(0.99, http_request_duration_seconds) |

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

| Metric | What It Measures | Example | |--------|-----------------|---------| | Utilization | % resource busy | CPU usage at 78% | | Saturation | Work queued/waiting | 12 requests queued in thread pool | | Errors | Error events on resource | 3 disk I/O errors in last minute |


Monitoring Stack

| Tool | Category | Best For | |------|----------|----------| | Prometheus | Metrics | Pull-based metrics, alerting rules | | Grafana | Visualisation | Dashboards for metrics, logs, traces | | Jaeger | Tracing | Distributed trace visualisation | | Loki | Logs | Log aggregation (pairs with Grafana) | | OpenTelemetry | Collection | Vendor-neutral telemetry collection |

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.


Alert Design

Severity Levels

| Severity | Response Time | Example | |----------|---------------|---------| | P1 | Immediate | Service fully down, data loss | | P2 | < 30 min | Error rate > 5%, latency p99 > 5s | | P3 | Business hours | Disk > 80%, cert expiring in 7 days | | P4 | Best effort | Non-critical deprecation warning |

Alert Fatigue Prevention

  • Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
  • Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
  • Require runbook links — every alert must link to diagnosis and remediation
  • Review monthly — delete or tune alerts that never fire or always fire
  • Group related alerts — use inhibition rules to suppress child alerts
  • Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

  • Total requests/sec across all services
  • Global error rate (%) with trendline
  • p50 / p95 / p99 latency
  • Active alerts count by severity
  • Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

  • RED metrics for each endpoint
  • Dependency health (upstream/downstream success rates)
  • Resource utilisation (CPU, memory, connections)
  • Top errors table with count and last seen

Observability Checklist

Every service must have:

  • [ ] Structured JSON logging with consistent schema
  • [ ] Correlation / trace IDs propagated on all requests
  • [ ] RED metrics exposed for every external endpoint
  • [ ] Health check endpoints (/healthz and /readyz)
  • [ ] Distributed tracing with OpenTelemetry
  • [ ] Dashboards for RED metrics and resource utilisation
  • [ ] Alerts for error rate, latency, and saturation with runbook links
  • [ ] Log level configurable at runtime without redeployment
  • [ ] PII scrubbing verified and tested
  • [ ] Retention policies defined for logs, metrics, and traces

Anti-Patterns

| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references | | Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow | | Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema | | String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message | | Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere | | Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication | | Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |

NEVER Do

  1. NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
  2. NEVER use console.log / print in production — use a structured logger
  3. NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
  4. NEVER create alerts without a runbook link — unactionable alerts erode trust
  5. NEVER rely on logs alone — you need metrics and traces for full observability
  6. NEVER log request/response bodies by default — opt-in only, with PII redaction
  7. NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
  8. NEVER skip context propagation in async flows — broken traces are worse than no traces