Observability standard¶

Every Lambda in the platform emits structured logs to stdout. CloudWatch Logs captures those entries. MOD-076 ingests them via CloudWatch subscription filters and routes metrics, alerts, and dashboards from there. Module code has no direct dependency on observability tooling — the contract is stdout JSON.

AWS X-Ray is used for distributed trace visualisation — service maps, trace waterfalls, and latency analytics. All Lambdas are instrumented via the ADOT (AWS Distro for OpenTelemetry) layer provisioned by MOD-076. Sampling is controlled centrally by the rule provisioned in MOD-104; module code does not configure sampling directly.

Structured log trace propagation (trace_id / correlation_id fields) is complementary, not replaced. Log fields are the audit trail — they are retained for 90 days hot and 7 years cold, and can be queried across all modules by trace_id. X-Ray traces are the investigation layer — they are retained for 30 days and provide the visual service map and waterfall timeline for incident analysis. Both mechanisms travel together: a trace_id in a log entry corresponds to an X-Ray trace with the same ID.

Structured log format¶

Each Lambda handler emits one JSON object per logical event to stdout. CloudWatch Logs treats each stdout line as a log record. MOD-076 ingests via subscription filter — module code does not call any observability SDK.

Example log entry¶

The following is a terminal (completion) entry from MOD-001, a payment posting in jurisdiction NZ:

{
  "trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
  "correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701",
  "module_id": "MOD-001",
  "jurisdiction": "NZ",
  "event_type": "posting_committed",
  "party_id": "c7f1d3a2-8b56-4e90-a1f2-3d4e5b6c7d8e",
  "account_id": "f2e4d6c8-9a10-4b3c-8d7e-6f5a4b3c2d1e",
  "duration_ms": 4,
  "level": "INFO",
  "error_code": null,
  "retryable": null,
  "db_query_ms": 2,
  "timestamp": "2026-04-15T03:17:42.381Z"
}

Mandatory fields¶

Field	Type	Required on	Notes
`trace_id`	uuid	every entry	Propagated from `X-Trace-Id` header or `event.detail.trace_id`; generated fresh if absent (log WARN `trace_id_missing_from_upstream`)
`correlation_id`	uuid	every entry	Scoped to a single Lambda invocation; new UUID per invocation
`module_id`	string	every entry	e.g. `MOD-001`; written as a constant in module code
`jurisdiction`	string	every entry	`NZ` or `AU`; sourced from JWT claim, account record, or event payload; `UNKNOWN` if unavailable
`event_type`	string	every entry	Present-tense noun_verb: `posting_committed`, `validation_failed`, `session_started`, etc.
`party_id`	uuid or null	every entry	`null` for system/platform events with no customer context
`account_id`	uuid or null	where applicable	`null` for non-account events
`duration_ms`	int	terminal entries	Wall-clock time from invocation start to function completion; omit on intermediate entries
`level`	string	every entry	`INFO` / `WARN` / `ERROR` / `DEBUG` (DEBUG suppressed in prod)
`error_code`	string or null	error entries	Matches the standard error envelope `error_code` defined in the error handling standard
`retryable`	bool or null	error entries	`true` for transient infrastructure failures; `false` for validation and business rule failures

Field guidance¶

event_type must come from the module's own declared event type registry — not free-text strings. Each module's design document defines the enumeration of valid event types for that module. Using undeclared strings breaks MOD-076 filtering rules.

PII must not appear in log field values. Log the party_id reference only — never name, date of birth, address, or national identifier. Log document type, not document number. Log amount range buckets (<$100, $100–$1,000, >$1,000) in INFO-level entries, not exact amounts. Exact amounts are written only to the audit trail (MOD-002), not to operational logs. This boundary is enforced by code review and automated secret scanning in CI.

duration_ms is the wall-clock time from Lambda invocation start to function completion for the outermost handler. Internal step timings are written as additional fields in the same log entry (e.g. db_query_ms, external_api_ms). This avoids nested log structures while preserving per-step timing.

Trace propagation¶

A trace_id represents a single end-to-end flow, potentially spanning multiple Lambda invocations, EventBridge events, and external calls. A correlation_id is scoped to one Lambda invocation. The trace_id travels unchanged through the entire flow; each Lambda generates a fresh correlation_id.

Inbound: HTTP (API Gateway)¶

Read the X-Trace-Id request header. If absent, generate a UUID4, log a WARN entry with event_type: trace_id_missing_from_upstream, and continue. Always echo X-Trace-Id back in the HTTP response, regardless of whether it was received or generated.

Inbound: EventBridge¶

Read event["detail"]["trace_id"]. If absent, generate a UUID4 and log WARN trace_id_missing_from_upstream.

Outbound: Lambda invocation (intra-domain)¶

Pass both trace_id and correlation_id in the invocation payload under a _meta envelope:

{
  "_meta": {
    "trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
    "correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701"
  },
  ...
}

Outbound: EventBridge event¶

Write trace_id into detail.trace_id. The correlation_id of the emitting Lambda is not forwarded — the receiving Lambda generates its own.

Outbound: HTTP (external provider)¶

Pass trace_id as the X-Trace-Id request header on all outbound HTTP calls to external providers (eIDV, sanctions screening, payment rails, etc.).

Immutability rule¶

A trace_id must never be dropped or reset mid-flow. Generating a new trace_id part-way through a flow breaks traceability and is treated as a defect. A new correlation_id is generated per Lambda invocation. The trace_id is preserved unchanged from the entry point to the terminal step.

Reference implementation¶

def extract_trace_context(event: dict, context) -> tuple[str, str]:
    """Extract or generate trace_id and correlation_id."""
    trace_id = (
        event.get("headers", {}).get("x-trace-id")
        or event.get("detail", {}).get("trace_id")
        or str(uuid4())
    )
    correlation_id = str(uuid4())
    return trace_id, correlation_id

Call this at the top of every Lambda handler, before any business logic. Pass the returned values into every log entry and every outbound call made during the invocation.

Distributed tracing (X-Ray)¶

X-Ray is provisioned at two levels:

Sampling rule — bank-platform-default-{env} at 5% fixed rate, reservoir 1. Provisioned by MOD-104 and exported to SSM at /bank/{env}/xray/sampling/arn. All Lambdas inherit this rule unless they declare a service-specific override.
ADOT layer — AWS Distro for OpenTelemetry Lambda layer. Provisioned by MOD-076 and exported to SSM at /bank/{env}/observability/adot-layer-arn. Attach this layer ARN to every Lambda function in the module's IaC.
X-Ray groups — One group per system domain (SD01–SD08), provisioned by MOD-076. Groups filter traces by module_id tag so the service map shows clean domain boundaries.

Instrumenting a Lambda¶

In module IaC, attach the ADOT layer to every Lambda function. No code changes are required — the layer intercepts the Node/Python runtime and reports spans automatically:

// SST / Pulumi — resolve the layer ARN from SSM, then attach
const adotLayerArn = aws.ssm.getParameterOutput({
    name: `/bank/${stage}/observability/adot-layer-arn`,
}).value;

const fn = new aws.lambda.Function("my-handler", {
    layers: [adotLayerArn],
    environment: {
        variables: {
            AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
            OPENTELEMETRY_COLLECTOR_CONFIG_FILE: "/var/task/collector.yaml",
        },
    },
    // ... rest of function config
});

Trace ID correlation¶

The trace_id field in structured logs is set to the X-Ray trace ID for the invocation. This means a CloudWatch Logs Insights query on trace_id returns the same flow that X-Ray shows as a waterfall. Both views are correlated on the same identifier — use whichever is more appropriate for the task (X-Ray for visual timeline during incidents; CWL Insights for audit queries across the full retention window).

Trace retention¶

X-Ray retains traces for 30 days (FR-307). Beyond 30 days, the structured logs in CloudWatch (90 days hot, 7 years cold) are the only record. Do not rely on X-Ray for regulatory audit purposes — use the Postgres audit trail (MOD-002) or log archive.

Metrics¶

Standard Lambda metrics (invocations, errors, duration, concurrency, throttles) are captured automatically by CloudWatch and require no module-level instrumentation. The following custom metrics are required in addition, emitted via CloudWatch Embedded Metrics Format (EMF).

EMF works by emitting a specially structured JSON object to stdout alongside regular log entries. The CloudWatch Logs agent parses it and publishes named metrics to CloudWatch Metrics without a separate SDK call. Dimensions and values are embedded in the JSON. This keeps the observability contract consistent: stdout is the only channel.

Required custom metrics¶

Metric	Emitted by	Method	Notes
`posting_committed_total`	SD01 modules	EMF	Counter; dimensions: `module_id`, `jurisdiction`, `currency`
`posting_rejected_total`	MOD-001	EMF	Counter; dimensions: `module_id`, `jurisdiction`, `error_code`
`balance_hold_active_gauge`	MOD-003	EMF	Gauge; dimensions: `jurisdiction`
`kyc_verification_duration_ms`	MOD-009	EMF	Histogram; dimensions: `jurisdiction`, `provider`, `outcome`
`sanctions_screen_duration_ms`	MOD-013	EMF	Histogram; dimensions: `jurisdiction`, `list_name`
`event_replication_lag_seconds`	MOD-042	EMF	Gauge; dimensions: `source_database`
`jwt_validation_failure_total`	MOD-044	EMF	Counter; dimensions: `pool`, `failure_reason`
`session_step_up_total`	MOD-068	EMF	Counter; dimensions: `jurisdiction`, `outcome`

EMF example¶

The following emits posting_committed_total from MOD-001. The _aws key signals EMF to CloudWatch; the rest of the object is a normal log entry and will be indexed by MOD-076 as a log record as well.

import json
from datetime import datetime, timezone

def emit_posting_committed_metric(
    trace_id: str,
    correlation_id: str,
    jurisdiction: str,
    currency: str,
) -> None:
    record = {
        "_aws": {
            "Timestamp": int(datetime.now(timezone.utc).timestamp() * 1000),
            "CloudWatchMetrics": [
                {
                    "Namespace": "bank/modules",
                    "Dimensions": [["module_id", "jurisdiction", "currency"]],
                    "Metrics": [{"Name": "posting_committed_total", "Unit": "Count"}],
                }
            ],
        },
        "trace_id": trace_id,
        "correlation_id": correlation_id,
        "module_id": "MOD-001",
        "jurisdiction": jurisdiction,
        "currency": currency,
        "posting_committed_total": 1,
        "level": "INFO",
        "event_type": "posting_committed",
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    print(json.dumps(record))

Service-level objectives¶

The following SLOs are monitored by MOD-076. Breach of the alert threshold triggers a PagerDuty notification to the on-call engineer. The alert payload includes: SLO name, current value, threshold, module_id, jurisdiction, and a link to the pre-built MOD-076 dashboard for that module.

SLO	Target	Alert threshold	Owner
API Gateway → posting success rate	≥ 99.9% over 5 min	< 99.5%	SD01
Posting p99 latency	≤ 10 ms	> 20 ms	SD01
Balance query p99 latency	≤ 5 ms	> 10 ms	SD01 / MOD-003
eIDV completion p99 latency	≤ 8 s	> 12 s	SD02 / MOD-009
Sanctions screening p99 latency	≤ 500 ms	> 750 ms	SD02 / MOD-013
JWT validation p99 latency	≤ 10 ms	> 25 ms	SD07 / MOD-044
CDC replication lag	≤ 5 min	> 5 min	SD07 / MOD-042
Payment API success rate	≥ 99.9% over 5 min	< 99.5%	SD04
MTTD for anomalous access	≤ 5 min	—	MOD-076 / NFR-023

SLO definitions are owned by the system domain listed. Changes to target or alert threshold require approval from the owning system's tech lead and a corresponding update to this page.

Dashboard standard¶

Every module must have a MOD-076 dashboard provisioned as part of its build. The dashboard is defined as infrastructure code in the module's repo — a CloudWatch dashboard JSON in the IaC directory. It is deployed alongside the module and is a required build artefact; a module is not considered Built without it.

Dashboard name convention: bank-{env}-{module_id} — for example, bank-prod-MOD-001, bank-staging-MOD-009.

Each dashboard must contain:

Error rate (5-minute rolling window)
p50 / p95 / p99 latency
Invocation count
Throttle count
Any module-specific custom metrics listed in the required custom metrics table above
DLQ depth (where the module writes to or reads from a queue)

The standard panel layout and widget configuration are defined in MOD-076's IaC templates. Module IaC imports the template and overrides with module-specific metrics. This ensures consistent layout across dashboards and reduces per-module dashboard work.

Log retention¶

Log type	Retention	Storage
Operational Lambda logs (hot)	90 days	CloudWatch Logs
Operational Lambda logs (cold archive)	7 years	S3 (KMS-encrypted, bank/operational key)
Security and auth events	7 years	CloudWatch Logs + S3 archive
Audit trail events	7 years	Postgres (append-only) + Snowflake replica

Hot retention of 90 days is the binding requirement (FR-307). CloudWatch Logs retention values are fixed integers; 90 days is a supported option.

Log archival from CloudWatch Logs to S3 is handled by MOD-076 via CloudWatch Logs subscription → Kinesis Firehose → S3. Module code does not manage archival directly. Module code does not configure retention policies on log groups — that is owned by the MOD-076 IaC.

CloudWatch Logs subscription filter limit: AWS enforces a maximum of 2 subscription filters per log group. MOD-076 occupies one filter slot for archival (Firehose). If a log group needs both archival and a second filter (e.g. error pattern routing), those two purposes must share the remaining slot via a combined filter pattern. Module IaC must not add a third filter without first checking existing filter count on the target log group.

The Postgres audit trail (MOD-002) is append-only and not subject to CloudWatch retention rules. It is replicated to Snowflake for analytical access. Operational Lambda logs are not a substitute for the audit trail and must not be used as one.