MOD-076 — Observability platform¶
System: SD07 | Repo: bank-platform | Phase: 1
Purpose¶
Provides centralised observability for every bank service:
- structured-log ingestion via existing /aws/bank-platform/* log groups
- custom metrics via CloudWatch EMF in the bank/modules namespace
- distributed tracing via the ADOT Lambda Layer -> X-Ray
- service-health dashboards (one per system domain + one platform overview)
- SLO alarms with 2-minute page latency
- 90-day hot retention + cold archival to the MOD-104 iceberg bucket under
observability-archive/
Architecture¶
Lambda (MOD-xxx)
stdout JSON -> CloudWatch Logs (/aws/bank-platform/{env})
| subscription filter: level="ERROR" (mod076-platform-error-sub)
| subscription filter: event_type="data_quality_anomaly_detected"
| -> alarm-router Lambda -> SNS (MOD-104 alerts topic)
|
+> subscription filter (empty pattern) -> Firehose
-> s3://iceberg/observability-archive/... (KMS: financial)
CloudWatch Alarms (error rate, p99 latency, DQ anomaly)
-> mod076-alarm-intake (SNS) -> alarm-router Lambda
-> classify severity -> MOD-104 alerts SNS topic
// TODO PagerDuty integration (dev stops at SNS)
Stack overview¶
| Stack | Resources | Policy |
|---|---|---|
| otel-collector | SSM params exposing the managed ADOT Lambda Layer ARN and collector YAML | — |
| subscription-filters | alarm-router Lambda + IAM role, intake SNS topic, structured-log pattern filters, data-quality pattern filter | — |
| slo-alarms | Error-rate metric-math alarm, p99 latency alarm, SSM threshold params | FR-306 |
| dashboards | Platform overview + per-domain CloudWatch dashboards | FR-308 |
| log-archival | Subscription filter -> Firehose -> iceberg bucket, X-Ray group, platform log-group retention 90d | FR-307 |
| log-immutability | IAM deny policy attached to every bank-platform runtime role | GOV-006 |
| data-quality-alerts | CloudWatch alarm on bank/data-quality#anomaly_total |
DT-004 |
SSM outputs¶
| SSM path | Value | Consumed by |
|---|---|---|
| /bank/{env}/observability/adot-collector-amd64-arn | ADOT collector-only Lambda Layer (AMD64) | AMD64 Lambdas with own SDK |
| /bank/{env}/observability/adot-collector-arm64-arn | ADOT collector-only Lambda Layer (ARM64 / Graviton) | ARM64 Lambdas with own SDK |
| /bank/{env}/observability/adot-nodejs-amd64-arn | ADOT Node.js auto-instrumentation Lambda Layer (AMD64); set AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-handler |
AMD64 Node.js Lambdas |
| /bank/{env}/observability/adot-nodejs-arm64-arn | ADOT Node.js auto-instrumentation Lambda Layer (ARM64) | ARM64 Node.js Lambdas |
| /bank/{env}/observability/adot-layer-arn | DEPRECATED alias of adot-collector-amd64-arn | legacy consumers (migrate) |
| /bank/{env}/observability/adot-nodejs-layer-arn | DEPRECATED alias of adot-nodejs-amd64-arn | legacy consumers (migrate) |
| /bank/{env}/observability/collector-config | ADOT collector YAML | Lambda extensions |
| /bank/{env}/observability/archive-firehose-arn | Firehose delivery stream ARN | FR-307 tests, audit |
| /bank/{env}/observability/xray-group-arn | X-Ray group ARN | services emitting traces |
| /bank/{env}/observability/alarm-router-arn | Lambda ARN for the alarm-router | MOD-043, MOD-063 |
| /bank/{env}/observability/alarm-intake-topic-arn | SNS intake topic ARN | other modules routing custom alarms |
| /bank/{env}/sns/alarm-intake/arn | Alias of observability/alarm-intake-topic-arn under the cross-cutting /sns/ namespace |
Snowflake-side alerts (MOD-041 model-drift, MOD-030/065 DLQ depth) via SNOWFLAKE.NOTIFICATION.GET_PARAMETER |
| /bank/{env}/observability/platform-dashboard-arn | Dashboard ARN | engineering / on-call |
| /bank/{env}/observability/dashboards/{domain}/arn | Dashboard ARN per domain | each system domain |
| /bank/{env}/observability/alarms/error-rate-arn | Alarm ARN | FR-306 tests |
| /bank/{env}/observability/alarms/p99-latency-arn | Alarm ARN | FR-306 tests |
| /bank/{env}/observability/alarms/data-quality-arn | Alarm ARN | DT-004 tests |
| /bank/{env}/observability/slo/default-p99-latency-ms | Default p99 ms threshold | tech leads (override per service) |
| /bank/{env}/observability/slo/default-error-rate-pct | Default error-rate % threshold | tech leads |
| /bank/{env}/observability/log-immutability-policy-arn | IAM deny policy ARN | GOV-006 test + SD07 IAM reviews |
Requirement coverage¶
| Req | Gate test |
|---|---|
| FR-305 | fr-305-ingest-latency.test.ts — PUT -> Insights within 60s |
| FR-306 | fr-306-error-alarm.test.ts — alarm -> alarm-router log within 2 min |
| FR-307 | fr-307-retention.test.ts — platform log groups >=90d, X-Ray group, Firehose -> iceberg |
| FR-308 | fr-308-dashboard-exists.test.ts — every domain dashboard + widgets |
| GOV-006 LOG | pol-gov-006-log-record.test.ts + pol-gov-006-immutability.test.ts |
| DT-004 ALERT | pol-dt-004-latency.test.ts — DQ anomaly -> SNS hop < 2 min |
Notes / deviations¶
- X-Ray vs. observability-standard: MOD-104 provisions an X-Ray sampling rule but the observability-standard page explicitly says "AWS X-Ray is not used". I kept the X-Ray group here because (a) FR-307 requires trace retention, (b) ADOT collector exports traces to X-Ray by default, and (c) MOD-104 already has an X-Ray sampling rule. Flagged as a wiki correction candidate.
- CloudWatch Logs immutability: there is no event-level UPDATE/DELETE
API, so GOV-006 at the event layer is enforced by the SDK surface. Log
groups and streams are protected by the
mod076-log-immutability-denyIAM policy. Administrators still retain break-glass capability (by design) — this is consistent with MOD-104's CloudTrail immutability approach. - Cold archival bucket: per task decision, reuses MOD-104's iceberg
bucket with a dedicated
observability-archive/prefix. Destination encryption uses the financial CMK (that is what iceberg already enforces). Firehose stream-level SSE uses the operational CMK. - PagerDuty: TODO in
alarm-router/index.ts. For dev we route all severities to MOD-104's alerts SNS topic only. - Per-service SLO thresholds: default p99 ms + error-rate % are in SSM
so tech leads can override without a new deploy. Per-service overrides
belong at
/bank/{env}/observability/slo/{service}/p99-latency-ms(not provisioned by MOD-076 — each module seeds its own override).