Skip to content

Observability standard

Related: Delivery methodology · MOD-076 · Error handling standard · Interface contracts


Every Lambda in the platform emits structured logs to stdout. CloudWatch Logs captures those entries. MOD-076 ingests them via CloudWatch subscription filters and routes metrics, alerts, and dashboards from there. Module code has no direct dependency on observability tooling — the contract is stdout JSON.

AWS X-Ray is used for distributed trace visualisation — service maps, trace waterfalls, and latency analytics. All Lambdas are instrumented via the ADOT (AWS Distro for OpenTelemetry) layer provisioned by MOD-076. Sampling is controlled centrally by the rule provisioned in MOD-104; module code does not configure sampling directly.

Structured log trace propagation (trace_id / correlation_id fields) is complementary, not replaced. Log fields are the audit trail — they are retained for 90 days hot and 7 years cold, and can be queried across all modules by trace_id. X-Ray traces are the investigation layer — they are retained for 30 days and provide the visual service map and waterfall timeline for incident analysis. Both mechanisms travel together: a trace_id in a log entry corresponds to an X-Ray trace with the same ID.


Structured log format

Each Lambda handler emits one JSON object per logical event to stdout. CloudWatch Logs treats each stdout line as a log record. MOD-076 ingests via subscription filter — module code does not call any observability SDK.

Example log entry

The following is a terminal (completion) entry from MOD-001, a payment posting in jurisdiction NZ:

{
  "trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
  "correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701",
  "module_id": "MOD-001",
  "jurisdiction": "NZ",
  "event_type": "posting_committed",
  "party_id": "c7f1d3a2-8b56-4e90-a1f2-3d4e5b6c7d8e",
  "account_id": "f2e4d6c8-9a10-4b3c-8d7e-6f5a4b3c2d1e",
  "duration_ms": 4,
  "level": "INFO",
  "error_code": null,
  "retryable": null,
  "db_query_ms": 2,
  "timestamp": "2026-04-15T03:17:42.381Z"
}

Mandatory fields

Field Type Required on Notes
trace_id uuid every entry Propagated from X-Trace-Id header or event.detail.trace_id; generated fresh if absent (log WARN trace_id_missing_from_upstream)
correlation_id uuid every entry Scoped to a single Lambda invocation; new UUID per invocation
module_id string every entry e.g. MOD-001; written as a constant in module code
jurisdiction string every entry NZ or AU; sourced from JWT claim, account record, or event payload; UNKNOWN if unavailable
event_type string every entry Present-tense noun_verb: posting_committed, validation_failed, session_started, etc.
party_id uuid or null every entry null for system/platform events with no customer context
account_id uuid or null where applicable null for non-account events
duration_ms int terminal entries Wall-clock time from invocation start to function completion; omit on intermediate entries
level string every entry INFO / WARN / ERROR / DEBUG (DEBUG suppressed in prod)
error_code string or null error entries Matches the standard error envelope error_code defined in the error handling standard
retryable bool or null error entries true for transient infrastructure failures; false for validation and business rule failures

Field guidance

event_type must come from the module's own declared event type registry — not free-text strings. Each module's design document defines the enumeration of valid event types for that module. Using undeclared strings breaks MOD-076 filtering rules.

PII must not appear in log field values. Log the party_id reference only — never name, date of birth, address, or national identifier. Log document type, not document number. Log amount range buckets (<$100, $100–$1,000, >$1,000) in INFO-level entries, not exact amounts. Exact amounts are written only to the audit trail (MOD-002), not to operational logs. This boundary is enforced by code review and automated secret scanning in CI.

duration_ms is the wall-clock time from Lambda invocation start to function completion for the outermost handler. Internal step timings are written as additional fields in the same log entry (e.g. db_query_ms, external_api_ms). This avoids nested log structures while preserving per-step timing.


Trace propagation

A trace_id represents a single end-to-end flow, potentially spanning multiple Lambda invocations, EventBridge events, and external calls. A correlation_id is scoped to one Lambda invocation. The trace_id travels unchanged through the entire flow; each Lambda generates a fresh correlation_id.

Inbound: HTTP (API Gateway)

Read the X-Trace-Id request header. If absent, generate a UUID4, log a WARN entry with event_type: trace_id_missing_from_upstream, and continue. Always echo X-Trace-Id back in the HTTP response, regardless of whether it was received or generated.

Inbound: EventBridge

Read event["detail"]["trace_id"]. If absent, generate a UUID4 and log WARN trace_id_missing_from_upstream.

Outbound: Lambda invocation (intra-domain)

Pass both trace_id and correlation_id in the invocation payload under a _meta envelope:

{
  "_meta": {
    "trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
    "correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701"
  },
  ...
}

Outbound: EventBridge event

Write trace_id into detail.trace_id. The correlation_id of the emitting Lambda is not forwarded — the receiving Lambda generates its own.

Outbound: HTTP (external provider)

Pass trace_id as the X-Trace-Id request header on all outbound HTTP calls to external providers (eIDV, sanctions screening, payment rails, etc.).

Immutability rule

A trace_id must never be dropped or reset mid-flow. Generating a new trace_id part-way through a flow breaks traceability and is treated as a defect. A new correlation_id is generated per Lambda invocation. The trace_id is preserved unchanged from the entry point to the terminal step.

Reference implementation

def extract_trace_context(event: dict, context) -> tuple[str, str]:
    """Extract or generate trace_id and correlation_id."""
    trace_id = (
        event.get("headers", {}).get("x-trace-id")
        or event.get("detail", {}).get("trace_id")
        or str(uuid4())
    )
    correlation_id = str(uuid4())
    return trace_id, correlation_id

Call this at the top of every Lambda handler, before any business logic. Pass the returned values into every log entry and every outbound call made during the invocation.


Distributed tracing (X-Ray)

X-Ray is provisioned at two levels:

  • Sampling rulebank-platform-default-{env} at 5% fixed rate, reservoir 1. Provisioned by MOD-104 and exported to SSM at /bank/{env}/xray/sampling/arn. All Lambdas inherit this rule unless they declare a service-specific override.
  • ADOT layer — AWS Distro for OpenTelemetry Lambda layer. Provisioned by MOD-076 and exported to SSM at /bank/{env}/observability/adot-layer-arn. Attach this layer ARN to every Lambda function in the module's IaC.
  • X-Ray groups — One group per system domain (SD01–SD08), provisioned by MOD-076. Groups filter traces by module_id tag so the service map shows clean domain boundaries.

Instrumenting a Lambda

In module IaC, attach the ADOT layer to every Lambda function. No code changes are required — the layer intercepts the Node/Python runtime and reports spans automatically:

// SST / Pulumi — resolve the layer ARN from SSM, then attach
const adotLayerArn = aws.ssm.getParameterOutput({
    name: `/bank/${stage}/observability/adot-layer-arn`,
}).value;

const fn = new aws.lambda.Function("my-handler", {
    layers: [adotLayerArn],
    environment: {
        variables: {
            AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
            OPENTELEMETRY_COLLECTOR_CONFIG_FILE: "/var/task/collector.yaml",
        },
    },
    // ... rest of function config
});

Trace ID correlation

The trace_id field in structured logs is set to the X-Ray trace ID for the invocation. This means a CloudWatch Logs Insights query on trace_id returns the same flow that X-Ray shows as a waterfall. Both views are correlated on the same identifier — use whichever is more appropriate for the task (X-Ray for visual timeline during incidents; CWL Insights for audit queries across the full retention window).

Trace retention

X-Ray retains traces for 30 days (FR-307). Beyond 30 days, the structured logs in CloudWatch (90 days hot, 7 years cold) are the only record. Do not rely on X-Ray for regulatory audit purposes — use the Postgres audit trail (MOD-002) or log archive.


Metrics

Standard Lambda metrics (invocations, errors, duration, concurrency, throttles) are captured automatically by CloudWatch and require no module-level instrumentation. The following custom metrics are required in addition, emitted via CloudWatch Embedded Metrics Format (EMF).

EMF works by emitting a specially structured JSON object to stdout alongside regular log entries. The CloudWatch Logs agent parses it and publishes named metrics to CloudWatch Metrics without a separate SDK call. Dimensions and values are embedded in the JSON. This keeps the observability contract consistent: stdout is the only channel.

Required custom metrics

Metric Emitted by Method Notes
posting_committed_total SD01 modules EMF Counter; dimensions: module_id, jurisdiction, currency
posting_rejected_total MOD-001 EMF Counter; dimensions: module_id, jurisdiction, error_code
balance_hold_active_gauge MOD-003 EMF Gauge; dimensions: jurisdiction
kyc_verification_duration_ms MOD-009 EMF Histogram; dimensions: jurisdiction, provider, outcome
sanctions_screen_duration_ms MOD-013 EMF Histogram; dimensions: jurisdiction, list_name
event_replication_lag_seconds MOD-042 EMF Gauge; dimensions: source_database
jwt_validation_failure_total MOD-044 EMF Counter; dimensions: pool, failure_reason
session_step_up_total MOD-068 EMF Counter; dimensions: jurisdiction, outcome

EMF example

The following emits posting_committed_total from MOD-001. The _aws key signals EMF to CloudWatch; the rest of the object is a normal log entry and will be indexed by MOD-076 as a log record as well.

import json
from datetime import datetime, timezone

def emit_posting_committed_metric(
    trace_id: str,
    correlation_id: str,
    jurisdiction: str,
    currency: str,
) -> None:
    record = {
        "_aws": {
            "Timestamp": int(datetime.now(timezone.utc).timestamp() * 1000),
            "CloudWatchMetrics": [
                {
                    "Namespace": "bank/modules",
                    "Dimensions": [["module_id", "jurisdiction", "currency"]],
                    "Metrics": [{"Name": "posting_committed_total", "Unit": "Count"}],
                }
            ],
        },
        "trace_id": trace_id,
        "correlation_id": correlation_id,
        "module_id": "MOD-001",
        "jurisdiction": jurisdiction,
        "currency": currency,
        "posting_committed_total": 1,
        "level": "INFO",
        "event_type": "posting_committed",
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    print(json.dumps(record))

Service-level objectives

The following SLOs are monitored by MOD-076. Breach of the alert threshold triggers a PagerDuty notification to the on-call engineer. The alert payload includes: SLO name, current value, threshold, module_id, jurisdiction, and a link to the pre-built MOD-076 dashboard for that module.

SLO Target Alert threshold Owner
API Gateway → posting success rate ≥ 99.9% over 5 min < 99.5% SD01
Posting p99 latency ≤ 10 ms > 20 ms SD01
Balance query p99 latency ≤ 5 ms > 10 ms SD01 / MOD-003
eIDV completion p99 latency ≤ 8 s > 12 s SD02 / MOD-009
Sanctions screening p99 latency ≤ 500 ms > 750 ms SD02 / MOD-013
JWT validation p99 latency ≤ 10 ms > 25 ms SD07 / MOD-044
CDC replication lag ≤ 5 min > 5 min SD07 / MOD-042
Payment API success rate ≥ 99.9% over 5 min < 99.5% SD04
MTTD for anomalous access ≤ 5 min MOD-076 / NFR-023

SLO definitions are owned by the system domain listed. Changes to target or alert threshold require approval from the owning system's tech lead and a corresponding update to this page.


Dashboard standard

Every module must have a MOD-076 dashboard provisioned as part of its build. The dashboard is defined as infrastructure code in the module's repo — a CloudWatch dashboard JSON in the IaC directory. It is deployed alongside the module and is a required build artefact; a module is not considered Built without it.

Dashboard name convention: bank-{env}-{module_id} — for example, bank-prod-MOD-001, bank-staging-MOD-009.

Each dashboard must contain:

  1. Error rate (5-minute rolling window)
  2. p50 / p95 / p99 latency
  3. Invocation count
  4. Throttle count
  5. Any module-specific custom metrics listed in the required custom metrics table above
  6. DLQ depth (where the module writes to or reads from a queue)

The standard panel layout and widget configuration are defined in MOD-076's IaC templates. Module IaC imports the template and overrides with module-specific metrics. This ensures consistent layout across dashboards and reduces per-module dashboard work.


Log retention

Log type Retention Storage
Operational Lambda logs (hot) 90 days CloudWatch Logs
Operational Lambda logs (cold archive) 7 years S3 (KMS-encrypted, bank/operational key)
Security and auth events 7 years CloudWatch Logs + S3 archive
Audit trail events 7 years Postgres (append-only) + Snowflake replica

Hot retention of 90 days is the binding requirement (FR-307). CloudWatch Logs retention values are fixed integers; 90 days is a supported option.

Log archival from CloudWatch Logs to S3 is handled by MOD-076 via CloudWatch Logs subscription → Kinesis Firehose → S3. Module code does not manage archival directly. Module code does not configure retention policies on log groups — that is owned by the MOD-076 IaC.

CloudWatch Logs subscription filter limit: AWS enforces a maximum of 2 subscription filters per log group. MOD-076 occupies one filter slot for archival (Firehose). If a log group needs both archival and a second filter (e.g. error pattern routing), those two purposes must share the remaining slot via a combined filter pattern. Module IaC must not add a third filter without first checking existing filter count on the target log group.

The Postgres audit trail (MOD-002) is append-only and not subject to CloudWatch retention rules. It is replicated to Snowflake for analytical access. Operational Lambda logs are not a substitute for the audit trail and must not be used as one.