Skip to content

Observability platform

ID MOD-076
System SD07
Repo bank-platform
Build status Deployed
Deployed Yes
Last commit bbdfbac46a1b5cf6dc25b4c7cd428a8daa669d03

The observability platform provides the platform engineering team with full visibility into the runtime behaviour of every service: distributed traces for request-level debugging, time-series metrics for system health and SLO monitoring, structured logs for error investigation, and alerting rules that page the on-call engineer when something needs immediate attention.

All services emit traces using an OpenTelemetry SDK. The observability platform collects, correlates, and stores these traces, allowing any individual API call to be followed end-to-end across all services it touched — including the time spent, any errors encountered, and the database queries executed at each step. This is the primary tool for diagnosing latency regressions and cascading failures in production.

Metrics cover both system-level indicators (CPU, memory, network, queue depth) and business-level indicators (payment processing rate, KYC decision latency, fraud score distribution). SLO dashboards track the bank's reliability commitments over rolling windows. Alerting routes pages to the on-call engineer via the configured channel (PagerDuty, Slack) based on severity rules. Logs are centralised with full-text search and retained for 90 days in hot storage and 7 years in cold storage for regulatory purposes.


Module dependencies

Depends on

Module Title Required? Contract Reason
MOD-104 AWS shared infrastructure bootstrap Required AWS shared infrastructure provisioned by MOD-104 (EventBridge buses, S3, KMS, Kinesis, Cognito) is required before this module can be deployed.

Required by

Module Title As Contract
MOD-032 LCR / NSFR calculator Hard dependency
MOD-033 RWA & capital ratio engine Hard dependency
MOD-058 Regulatory incident & breach notification engine Hard dependency
MOD-087 Transaction enrichment engine Hard dependency
MOD-102 Snowflake account configuration & governance Optional enhancement
MOD-118 Member equity and share registry Hard dependency
MOD-150 Risk management platform Hard dependency
MOD-156 CI/CD pipeline platform Hard dependency
MOD-157 External provider stub service Hard dependency

Policies satisfied

Policy Title Mode How
GOV-006 Internal Audit Policy LOG Platform-level system events, errors, and performance anomalies are captured in the observability store — available for internal audit review.
DT-004 Data Governance Policy ALERT Data quality anomalies detected by pipeline monitors are surfaced as observability alerts — the DT-004 obligation to detect and respond is operationalised here.

Capabilities satisfied

Capability Title Mode How
CAP-123 Distributed tracing & APM AUTO Collects distributed traces across all services using OpenTelemetry, correlating spans by trace ID so any request can be followed end-to-end across the platform.
CAP-124 Metrics, alerting & log aggregation ALERT Collects metrics from all services, evaluates alerting rules, routes notifications to the on-call team, and aggregates logs into a searchable store with a 90-day hot retention window.

Part of SD07 — Data Platform & Governance Infrastructure Compiled 2026-05-22 from source/entities/modules/MOD-076.yaml