ADR-031: Observability — OpenTelemetry, CloudWatch, and X-Ray¶
| Status | Accepted |
| Date | 2026-04-10 |
| Deciders | CTO, Head of Architecture |
| Affects repos | bank-core, bank-kyc, bank-aml, bank-payments, bank-credit, bank-risk-platform, bank-platform, bank-app |
Status¶
Accepted — 2026-04-10
Context¶
The platform comprises 8 Lambda-heavy system domains across separate repos. Without a consistent observability strategy, correlating logs, traces, and metrics across a distributed request path (e.g. API Gateway → payments Lambda → EventBridge → AML Lambda → Snowflake write-back) is not feasible. No observability tooling was specified in prior ADRs.
Decision¶
Instrumentation: AWS Lambda Powertools for TypeScript across all repos. Provides structured JSON logging, custom CloudWatch metrics, and AWS X-Ray distributed tracing with minimal boilerplate. Lambda Powertools is OpenTelemetry-compatible — the instrumentation layer is backend-agnostic.
Backend at launch: AWS CloudWatch (logs and metrics) and AWS X-Ray (distributed traces). Covered by existing Lambda and API Gateway pricing — zero marginal cost at dev and early production volumes.
Upgrade path: Grafana Cloud when CloudWatch becomes operationally limiting. Grafana Cloud is OTel-native (Loki for logs, Tempo for traces, Prometheus/Mimir for metrics), has a generous free tier (50GB logs/month), and accepts OTel exports without application changes. Swapping the backend is an exporter configuration change — no re-instrumentation across repos.
Standards all repos must follow¶
| Requirement | Detail |
|---|---|
| Structured logging | JSON only — no unstructured log lines |
| Correlation ID | Propagated through all Lambda hops and EventBridge events |
| No PII in logs or traces | Customer data referenced by ID only — no names, account numbers, or payment details in logs |
| Log retention | CloudWatch log group retention set explicitly — 90 days prod, 30 days non-prod (PRI-003) |
| Error alerting | CloudWatch Alarm on Lambda error rate > 1% per domain, routed to operations channel |
Rejected alternatives¶
| Option | Reason rejected |
|---|---|
| Datadog | Unpredictable cost model at scale; vendor lock-in on instrumentation if OTel not enforced from the start |
| Honeycomb | Strong distributed tracing UX; cost not justified at launch volume |
| CloudWatch only (no OTel layer) | Locks instrumentation to AWS — migrating any backend requires re-instrumentation across all repos |
Consequences¶
Zero marginal cost at launch. OTel instrumentation protects against backend lock-in. Lambda Powertools enforces consistent log structure and trace propagation across all repos. Migration to Grafana Cloud is a configuration change, not a code change.
Signoff record¶
| Date | Name | Role | Status |
|---|---|---|---|
| 2026-04-10 | Ross Millen | CTO | Approved |
| 2026-04-10 | Ross Millen | Head of Architecture | Approved |
| 2026-04-10 | Ross Millen | Head of Data | Approved |
Capabilities¶
| Capability | Description | Relationship |
|---|---|---|
| CAP-123 | Distributed tracing & APM | enabled — AWS X-Ray distributed tracing with OpenTelemetry instrumentation |
| CAP-124 | Metrics, alerting & log aggregation | enabled — CloudWatch metrics and alerting; structured JSON logs |
Related decisions¶
| ADR | Title | Relationship |
|---|---|---|
| ADR-025 | API layer — HTTP API Gateway and SST | Lambda execution model that is instrumented |
| ADR-029 (superseded by ADR-051; see ADR-051 for current EventBridge bus naming convention) | Domain event routing via Amazon EventBridge | correlation ID must propagate across EventBridge events |
All ADRs
Compiled 2026-05-22 from source/entities/adrs/ADR-031.yaml