Skip to content

ADR-031: Observability — OpenTelemetry, CloudWatch, and X-Ray

Status Accepted
Date 2026-04-10
Deciders CTO, Head of Architecture
Affects repos bank-core, bank-kyc, bank-aml, bank-payments, bank-credit, bank-risk-platform, bank-platform, bank-app

Status

Accepted — 2026-04-10

Context

The platform comprises 8 Lambda-heavy system domains across separate repos. Without a consistent observability strategy, correlating logs, traces, and metrics across a distributed request path (e.g. API Gateway → payments Lambda → EventBridge → AML Lambda → Snowflake write-back) is not feasible. No observability tooling was specified in prior ADRs.

Decision

Instrumentation: AWS Lambda Powertools for TypeScript across all repos. Provides structured JSON logging, custom CloudWatch metrics, and AWS X-Ray distributed tracing with minimal boilerplate. Lambda Powertools is OpenTelemetry-compatible — the instrumentation layer is backend-agnostic.

Backend at launch: AWS CloudWatch (logs and metrics) and AWS X-Ray (distributed traces). Covered by existing Lambda and API Gateway pricing — zero marginal cost at dev and early production volumes.

Upgrade path: Grafana Cloud when CloudWatch becomes operationally limiting. Grafana Cloud is OTel-native (Loki for logs, Tempo for traces, Prometheus/Mimir for metrics), has a generous free tier (50GB logs/month), and accepts OTel exports without application changes. Swapping the backend is an exporter configuration change — no re-instrumentation across repos.

Standards all repos must follow

Requirement Detail
Structured logging JSON only — no unstructured log lines
Correlation ID Propagated through all Lambda hops and EventBridge events
No PII in logs or traces Customer data referenced by ID only — no names, account numbers, or payment details in logs
Log retention CloudWatch log group retention set explicitly — 90 days prod, 30 days non-prod (PRI-003)
Error alerting CloudWatch Alarm on Lambda error rate > 1% per domain, routed to operations channel

Rejected alternatives

Option Reason rejected
Datadog Unpredictable cost model at scale; vendor lock-in on instrumentation if OTel not enforced from the start
Honeycomb Strong distributed tracing UX; cost not justified at launch volume
CloudWatch only (no OTel layer) Locks instrumentation to AWS — migrating any backend requires re-instrumentation across all repos

Consequences

Zero marginal cost at launch. OTel instrumentation protects against backend lock-in. Lambda Powertools enforces consistent log structure and trace propagation across all repos. Migration to Grafana Cloud is a configuration change, not a code change.



Signoff record

Date Name Role Status
2026-04-10 Ross Millen CTO Approved
2026-04-10 Ross Millen Head of Architecture Approved
2026-04-10 Ross Millen Head of Data Approved

Capabilities

Capability Description Relationship
CAP-123 Distributed tracing & APM enabled — AWS X-Ray distributed tracing with OpenTelemetry instrumentation
CAP-124 Metrics, alerting & log aggregation enabled — CloudWatch metrics and alerting; structured JSON logs

ADR Title Relationship
ADR-025 API layer — HTTP API Gateway and SST Lambda execution model that is instrumented
ADR-029 (superseded by ADR-051; see ADR-051 for current EventBridge bus naming convention) Domain event routing via Amazon EventBridge correlation ID must propagate across EventBridge events

All ADRs Compiled 2026-05-22 from source/entities/adrs/ADR-031.yaml