Skip to content

MOD-076 — Observability platform

System: SD07 | Repo: bank-platform | Phase: 1

Purpose

Provides centralised observability for every bank service: - structured-log ingestion via existing /aws/bank-platform/* log groups - custom metrics via CloudWatch EMF in the bank/modules namespace - distributed tracing via the ADOT Lambda Layer -> X-Ray - service-health dashboards (one per system domain + one platform overview) - SLO alarms with 2-minute page latency - 90-day hot retention + cold archival to the MOD-104 iceberg bucket under observability-archive/

Architecture

Lambda (MOD-xxx)
   stdout JSON -> CloudWatch Logs (/aws/bank-platform/{env})
      |  subscription filter: level="ERROR" (mod076-platform-error-sub)
      |  subscription filter: event_type="data_quality_anomaly_detected"
      |  -> alarm-router Lambda -> SNS (MOD-104 alerts topic)
      |
      +> subscription filter (empty pattern) -> Firehose
            -> s3://iceberg/observability-archive/... (KMS: financial)

CloudWatch Alarms (error rate, p99 latency, DQ anomaly)
   -> mod076-alarm-intake (SNS) -> alarm-router Lambda
      -> classify severity -> MOD-104 alerts SNS topic
      // TODO PagerDuty integration (dev stops at SNS)

Stack overview

Stack Resources Policy
otel-collector SSM params exposing the managed ADOT Lambda Layer ARN and collector YAML
subscription-filters alarm-router Lambda + IAM role, intake SNS topic, structured-log pattern filters, data-quality pattern filter
slo-alarms Error-rate metric-math alarm, p99 latency alarm, SSM threshold params FR-306
dashboards Platform overview + per-domain CloudWatch dashboards FR-308
log-archival Subscription filter -> Firehose -> iceberg bucket, X-Ray group, platform log-group retention 90d FR-307
log-immutability IAM deny policy attached to every bank-platform runtime role GOV-006
data-quality-alerts CloudWatch alarm on bank/data-quality#anomaly_total DT-004

SSM outputs

SSM path Value Consumed by
/bank/{env}/observability/adot-collector-amd64-arn ADOT collector-only Lambda Layer (AMD64) AMD64 Lambdas with own SDK
/bank/{env}/observability/adot-collector-arm64-arn ADOT collector-only Lambda Layer (ARM64 / Graviton) ARM64 Lambdas with own SDK
/bank/{env}/observability/adot-nodejs-amd64-arn ADOT Node.js auto-instrumentation Lambda Layer (AMD64); set AWS_LAMBDA_EXEC_WRAPPER=/opt/otel-handler AMD64 Node.js Lambdas
/bank/{env}/observability/adot-nodejs-arm64-arn ADOT Node.js auto-instrumentation Lambda Layer (ARM64) ARM64 Node.js Lambdas
/bank/{env}/observability/adot-layer-arn DEPRECATED alias of adot-collector-amd64-arn legacy consumers (migrate)
/bank/{env}/observability/adot-nodejs-layer-arn DEPRECATED alias of adot-nodejs-amd64-arn legacy consumers (migrate)
/bank/{env}/observability/collector-config ADOT collector YAML Lambda extensions
/bank/{env}/observability/archive-firehose-arn Firehose delivery stream ARN FR-307 tests, audit
/bank/{env}/observability/xray-group-arn X-Ray group ARN services emitting traces
/bank/{env}/observability/alarm-router-arn Lambda ARN for the alarm-router MOD-043, MOD-063
/bank/{env}/observability/alarm-intake-topic-arn SNS intake topic ARN other modules routing custom alarms
/bank/{env}/sns/alarm-intake/arn Alias of observability/alarm-intake-topic-arn under the cross-cutting /sns/ namespace Snowflake-side alerts (MOD-041 model-drift, MOD-030/065 DLQ depth) via SNOWFLAKE.NOTIFICATION.GET_PARAMETER
/bank/{env}/observability/platform-dashboard-arn Dashboard ARN engineering / on-call
/bank/{env}/observability/dashboards/{domain}/arn Dashboard ARN per domain each system domain
/bank/{env}/observability/alarms/error-rate-arn Alarm ARN FR-306 tests
/bank/{env}/observability/alarms/p99-latency-arn Alarm ARN FR-306 tests
/bank/{env}/observability/alarms/data-quality-arn Alarm ARN DT-004 tests
/bank/{env}/observability/slo/default-p99-latency-ms Default p99 ms threshold tech leads (override per service)
/bank/{env}/observability/slo/default-error-rate-pct Default error-rate % threshold tech leads
/bank/{env}/observability/log-immutability-policy-arn IAM deny policy ARN GOV-006 test + SD07 IAM reviews

Requirement coverage

Req Gate test
FR-305 fr-305-ingest-latency.test.ts — PUT -> Insights within 60s
FR-306 fr-306-error-alarm.test.ts — alarm -> alarm-router log within 2 min
FR-307 fr-307-retention.test.ts — platform log groups >=90d, X-Ray group, Firehose -> iceberg
FR-308 fr-308-dashboard-exists.test.ts — every domain dashboard + widgets
GOV-006 LOG pol-gov-006-log-record.test.ts + pol-gov-006-immutability.test.ts
DT-004 ALERT pol-dt-004-latency.test.ts — DQ anomaly -> SNS hop < 2 min

Notes / deviations

  • X-Ray vs. observability-standard: MOD-104 provisions an X-Ray sampling rule but the observability-standard page explicitly says "AWS X-Ray is not used". I kept the X-Ray group here because (a) FR-307 requires trace retention, (b) ADOT collector exports traces to X-Ray by default, and (c) MOD-104 already has an X-Ray sampling rule. Flagged as a wiki correction candidate.
  • CloudWatch Logs immutability: there is no event-level UPDATE/DELETE API, so GOV-006 at the event layer is enforced by the SDK surface. Log groups and streams are protected by the mod076-log-immutability-deny IAM policy. Administrators still retain break-glass capability (by design) — this is consistent with MOD-104's CloudTrail immutability approach.
  • Cold archival bucket: per task decision, reuses MOD-104's iceberg bucket with a dedicated observability-archive/ prefix. Destination encryption uses the financial CMK (that is what iceberg already enforces). Firehose stream-level SSE uses the operational CMK.
  • PagerDuty: TODO in alarm-router/index.ts. For dev we route all severities to MOD-104's alerts SNS topic only.
  • Per-service SLO thresholds: default p99 ms + error-rate % are in SSM so tech leads can override without a new deploy. Per-service overrides belong at /bank/{env}/observability/slo/{service}/p99-latency-ms (not provisioned by MOD-076 — each module seeds its own override).