Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

Three Pillars

| Pillar | Purpose | Question It Answers | Example | |--------|---------|---------------------|---------| | Logs | What happened | Why did this request fail? | {"level":"error","msg":"payment declined","user_id":"u_82"} | | Metrics | How much / how fast | Is latency increasing? | http_request_duration_seconds{route="/api/orders"} 0.342 | | Traces | Request flow | Where is the bottleneck? | Span: api-gateway → auth → order-service → db |

Each pillar is strongest when correlated. Embed trace_id in every log line to jump from a log entry to the full distributed trace.

Structured Logging

Always emit logs as structured JSON — never free-text strings.

Required Fields

| Field | Purpose | Required | |-------|---------|----------| | timestamp | ISO-8601 with milliseconds | Yes | | level | Severity (DEBUG … FATAL) | Yes | | service | Originating service name | Yes | | message | Human-readable description | Yes | | trace_id | Distributed trace correlation | Yes | | span_id | Current span within trace | Yes | | correlation_id | Business-level correlation (order ID) | When applicable | | error | Structured error object | On errors | | context | Request-specific metadata | Recommended |

Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});

Library Recommendations

| Library | Language | Strengths | Perf | |---------|----------|-----------|------| | Pino | Node.js | Fastest Node logger, low overhead | Excellent | | structlog | Python | Composable processors, context binding | Good | | zerolog | Go | Zero-allocation JSON logging | Excellent | | zap | Go | High performance, typed fields | Excellent | | tracing | Rust | Spans + events, async-aware | Excellent |

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

Log Levels

| Level | When to Use | Example | |-------|-------------|---------| | FATAL | App cannot continue, process will exit | Database connection pool exhausted | | ERROR | Operation failed, needs attention | Payment charge failed: CARD_DECLINED | | WARN | Unexpected but recoverable | Retry 2/3 for upstream timeout | | INFO | Normal business events | Order ORD-1234 placed successfully | | DEBUG | Developer troubleshooting | Cache miss for key user:82:preferences | | TRACE | Very fine-grained (rarely in prod) | Entering validateAddress with payload |

Rules: Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

Distributed Tracing

OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();

Span Creation

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context Propagation

Use W3C Trace Context (traceparent header) — default in OTel
Propagate across HTTP, gRPC, and message queues
For async workers: serialise traceparent into the job payload

Trace Sampling

| Strategy | Use When | |----------|----------| | Always On | Low-traffic services, debugging | | Probabilistic (N%) | General production use | | Rate-limited (N/sec) | High-throughput services | | Tail-based | When you need all error traces |

Always sample 100% of error traces regardless of strategy.

Metrics Collection

RED Method (Request-Driven)

Monitor these three for every service endpoint:

| Metric | What It Measures | Prometheus Example | |--------|-----------------|-------------------| | Rate | Requests/sec | rate(http_requests_total[5m]) | | Errors | Failed request ratio | rate(http_requests_total{status=~"5.."}[5m]) | | Duration | Response time | histogram_quantile(0.99, http_request_duration_seconds) |

USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

| Metric | What It Measures | Example | |--------|-----------------|---------| | Utilization | % resource busy | CPU usage at 78% | | Saturation | Work queued/waiting | 12 requests queued in thread pool | | Errors | Error events on resource | 3 disk I/O errors in last minute |

Monitoring Stack

| Tool | Category | Best For | |------|----------|----------| | Prometheus | Metrics | Pull-based metrics, alerting rules | | Grafana | Visualisation | Dashboards for metrics, logs, traces | | Jaeger | Tracing | Distributed trace visualisation | | Loki | Logs | Log aggregation (pairs with Grafana) | | OpenTelemetry | Collection | Vendor-neutral telemetry collection |

Recommendation: Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

Alert Design

Severity Levels

| Severity | Response Time | Example | |----------|---------------|---------| | P1 | Immediate | Service fully down, data loss | | P2 | < 30 min | Error rate > 5%, latency p99 > 5s | | P3 | Business hours | Disk > 80%, cert expiring in 7 days | | P4 | Best effort | Non-critical deprecation warning |

Alert Fatigue Prevention

Alert on symptoms, not causes — "error rate > 5%" not "pod restarted"
Multi-window, multi-burn-rate — catch both sudden spikes and slow burns
Require runbook links — every alert must link to diagnosis and remediation
Review monthly — delete or tune alerts that never fire or always fire
Group related alerts — use inhibition rules to suppress child alerts
Set appropriate thresholds — if alert fires daily and is ignored, raise threshold or delete

Dashboard Patterns

Overview Dashboard ("War Room")

Total requests/sec across all services
Global error rate (%) with trendline
p50 / p95 / p99 latency
Active alerts count by severity
Deployment markers overlaid on graphs

Service Dashboard (Per-Service)

RED metrics for each endpoint
Dependency health (upstream/downstream success rates)
Resource utilisation (CPU, memory, connections)
Top errors table with count and last seen

Observability Checklist

Every service must have:

[ ] Structured JSON logging with consistent schema
[ ] Correlation / trace IDs propagated on all requests
[ ] RED metrics exposed for every external endpoint
[ ] Health check endpoints (/healthz and /readyz)
[ ] Distributed tracing with OpenTelemetry
[ ] Dashboards for RED metrics and resource utilisation
[ ] Alerts for error rate, latency, and saturation with runbook links
[ ] Log level configurable at runtime without redeployment
[ ] PII scrubbing verified and tested
[ ] Retention policies defined for logs, metrics, and traces

Anti-Patterns

| Anti-Pattern | Problem | Fix | |-------------|---------|-----| | Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references | | Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow | | Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema | | String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message | | Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere | | Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication | | Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |

NEVER Do

NEVER log passwords, tokens, API keys, or secrets — even at DEBUG level
NEVER use console.log / print in production — use a structured logger
NEVER use user IDs, emails, or request IDs as metric labels — cardinality will explode
NEVER create alerts without a runbook link — unactionable alerts erode trust
NEVER rely on logs alone — you need metrics and traces for full observability
NEVER log request/response bodies by default — opt-in only, with PII redaction
NEVER ignore log volume — set budgets and alert when a service exceeds daily quota
NEVER skip context propagation in async flows — broken traces are worse than no traces