Observability standard¶
Related: Delivery methodology · MOD-076 · Error handling standard · Interface contracts
Every Lambda in the platform emits structured logs to stdout. CloudWatch Logs captures those entries. MOD-076 ingests them via CloudWatch subscription filters and routes metrics, alerts, and dashboards from there. Module code has no direct dependency on observability tooling — the contract is stdout JSON.
AWS X-Ray is used for distributed trace visualisation — service maps, trace waterfalls, and latency analytics. All Lambdas are instrumented via the ADOT (AWS Distro for OpenTelemetry) layer provisioned by MOD-076. Sampling is controlled centrally by the rule provisioned in MOD-104; module code does not configure sampling directly.
Structured log trace propagation (trace_id / correlation_id fields) is complementary, not replaced. Log fields are the audit trail — they are retained for 90 days hot and 7 years cold, and can be queried across all modules by trace_id. X-Ray traces are the investigation layer — they are retained for 30 days and provide the visual service map and waterfall timeline for incident analysis. Both mechanisms travel together: a trace_id in a log entry corresponds to an X-Ray trace with the same ID.
Structured log format¶
Each Lambda handler emits one JSON object per logical event to stdout. CloudWatch Logs treats each stdout line as a log record. MOD-076 ingests via subscription filter — module code does not call any observability SDK.
Example log entry¶
The following is a terminal (completion) entry from MOD-001, a payment posting in jurisdiction NZ:
{
"trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
"correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701",
"module_id": "MOD-001",
"jurisdiction": "NZ",
"event_type": "posting_committed",
"party_id": "c7f1d3a2-8b56-4e90-a1f2-3d4e5b6c7d8e",
"account_id": "f2e4d6c8-9a10-4b3c-8d7e-6f5a4b3c2d1e",
"duration_ms": 4,
"level": "INFO",
"error_code": null,
"retryable": null,
"db_query_ms": 2,
"timestamp": "2026-04-15T03:17:42.381Z"
}
Mandatory fields¶
| Field | Type | Required on | Notes |
|---|---|---|---|
trace_id |
uuid | every entry | Propagated from X-Trace-Id header or event.detail.trace_id; generated fresh if absent (log WARN trace_id_missing_from_upstream) |
correlation_id |
uuid | every entry | Scoped to a single Lambda invocation; new UUID per invocation |
module_id |
string | every entry | e.g. MOD-001; written as a constant in module code |
jurisdiction |
string | every entry | NZ or AU; sourced from JWT claim, account record, or event payload; UNKNOWN if unavailable |
event_type |
string | every entry | Present-tense noun_verb: posting_committed, validation_failed, session_started, etc. |
party_id |
uuid or null | every entry | null for system/platform events with no customer context |
account_id |
uuid or null | where applicable | null for non-account events |
duration_ms |
int | terminal entries | Wall-clock time from invocation start to function completion; omit on intermediate entries |
level |
string | every entry | INFO / WARN / ERROR / DEBUG (DEBUG suppressed in prod) |
error_code |
string or null | error entries | Matches the standard error envelope error_code defined in the error handling standard |
retryable |
bool or null | error entries | true for transient infrastructure failures; false for validation and business rule failures |
Field guidance¶
event_type must come from the module's own declared event type registry — not free-text strings. Each module's design document defines the enumeration of valid event types for that module. Using undeclared strings breaks MOD-076 filtering rules.
PII must not appear in log field values. Log the party_id reference only — never name, date of birth, address, or national identifier. Log document type, not document number. Log amount range buckets (<$100, $100–$1,000, >$1,000) in INFO-level entries, not exact amounts. Exact amounts are written only to the audit trail (MOD-002), not to operational logs. This boundary is enforced by code review and automated secret scanning in CI.
duration_ms is the wall-clock time from Lambda invocation start to function completion for the outermost handler. Internal step timings are written as additional fields in the same log entry (e.g. db_query_ms, external_api_ms). This avoids nested log structures while preserving per-step timing.
Trace propagation¶
A trace_id represents a single end-to-end flow, potentially spanning multiple Lambda invocations, EventBridge events, and external calls. A correlation_id is scoped to one Lambda invocation. The trace_id travels unchanged through the entire flow; each Lambda generates a fresh correlation_id.
Inbound: HTTP (API Gateway)¶
Read the X-Trace-Id request header. If absent, generate a UUID4, log a WARN entry with event_type: trace_id_missing_from_upstream, and continue. Always echo X-Trace-Id back in the HTTP response, regardless of whether it was received or generated.
Inbound: EventBridge¶
Read event["detail"]["trace_id"]. If absent, generate a UUID4 and log WARN trace_id_missing_from_upstream.
Outbound: Lambda invocation (intra-domain)¶
Pass both trace_id and correlation_id in the invocation payload under a _meta envelope:
{
"_meta": {
"trace_id": "a3f8c2d1-7e45-4b09-b6f2-91d3e0c5a847",
"correlation_id": "e1b2a3c4-5d6e-7f80-91a2-b3c4d5e6f701"
},
...
}
Outbound: EventBridge event¶
Write trace_id into detail.trace_id. The correlation_id of the emitting Lambda is not forwarded — the receiving Lambda generates its own.
Outbound: HTTP (external provider)¶
Pass trace_id as the X-Trace-Id request header on all outbound HTTP calls to external providers (eIDV, sanctions screening, payment rails, etc.).
Immutability rule¶
A trace_id must never be dropped or reset mid-flow. Generating a new trace_id part-way through a flow breaks traceability and is treated as a defect. A new correlation_id is generated per Lambda invocation. The trace_id is preserved unchanged from the entry point to the terminal step.
Reference implementation¶
def extract_trace_context(event: dict, context) -> tuple[str, str]:
"""Extract or generate trace_id and correlation_id."""
trace_id = (
event.get("headers", {}).get("x-trace-id")
or event.get("detail", {}).get("trace_id")
or str(uuid4())
)
correlation_id = str(uuid4())
return trace_id, correlation_id
Call this at the top of every Lambda handler, before any business logic. Pass the returned values into every log entry and every outbound call made during the invocation.
Distributed tracing (X-Ray)¶
X-Ray is provisioned at two levels:
- Sampling rule —
bank-platform-default-{env}at 5% fixed rate, reservoir 1. Provisioned by MOD-104 and exported to SSM at/bank/{env}/xray/sampling/arn. All Lambdas inherit this rule unless they declare a service-specific override. - ADOT layer — AWS Distro for OpenTelemetry Lambda layer. Provisioned by MOD-076 and exported to SSM at
/bank/{env}/observability/adot-layer-arn. Attach this layer ARN to every Lambda function in the module's IaC. - X-Ray groups — One group per system domain (SD01–SD08), provisioned by MOD-076. Groups filter traces by
module_idtag so the service map shows clean domain boundaries.
Instrumenting a Lambda¶
In module IaC, attach the ADOT layer to every Lambda function. No code changes are required — the layer intercepts the Node/Python runtime and reports spans automatically:
// SST / Pulumi — resolve the layer ARN from SSM, then attach
const adotLayerArn = aws.ssm.getParameterOutput({
name: `/bank/${stage}/observability/adot-layer-arn`,
}).value;
const fn = new aws.lambda.Function("my-handler", {
layers: [adotLayerArn],
environment: {
variables: {
AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: "/var/task/collector.yaml",
},
},
// ... rest of function config
});
Trace ID correlation¶
The trace_id field in structured logs is set to the X-Ray trace ID for the invocation. This means a CloudWatch Logs Insights query on trace_id returns the same flow that X-Ray shows as a waterfall. Both views are correlated on the same identifier — use whichever is more appropriate for the task (X-Ray for visual timeline during incidents; CWL Insights for audit queries across the full retention window).
Trace retention¶
X-Ray retains traces for 30 days (FR-307). Beyond 30 days, the structured logs in CloudWatch (90 days hot, 7 years cold) are the only record. Do not rely on X-Ray for regulatory audit purposes — use the Postgres audit trail (MOD-002) or log archive.
Metrics¶
Standard Lambda metrics (invocations, errors, duration, concurrency, throttles) are captured automatically by CloudWatch and require no module-level instrumentation. The following custom metrics are required in addition, emitted via CloudWatch Embedded Metrics Format (EMF).
EMF works by emitting a specially structured JSON object to stdout alongside regular log entries. The CloudWatch Logs agent parses it and publishes named metrics to CloudWatch Metrics without a separate SDK call. Dimensions and values are embedded in the JSON. This keeps the observability contract consistent: stdout is the only channel.
Required custom metrics¶
| Metric | Emitted by | Method | Notes |
|---|---|---|---|
posting_committed_total |
SD01 modules | EMF | Counter; dimensions: module_id, jurisdiction, currency |
posting_rejected_total |
MOD-001 | EMF | Counter; dimensions: module_id, jurisdiction, error_code |
balance_hold_active_gauge |
MOD-003 | EMF | Gauge; dimensions: jurisdiction |
kyc_verification_duration_ms |
MOD-009 | EMF | Histogram; dimensions: jurisdiction, provider, outcome |
sanctions_screen_duration_ms |
MOD-013 | EMF | Histogram; dimensions: jurisdiction, list_name |
event_replication_lag_seconds |
MOD-042 | EMF | Gauge; dimensions: source_database |
jwt_validation_failure_total |
MOD-044 | EMF | Counter; dimensions: pool, failure_reason |
session_step_up_total |
MOD-068 | EMF | Counter; dimensions: jurisdiction, outcome |
EMF example¶
The following emits posting_committed_total from MOD-001. The _aws key signals EMF to CloudWatch; the rest of the object is a normal log entry and will be indexed by MOD-076 as a log record as well.
import json
from datetime import datetime, timezone
def emit_posting_committed_metric(
trace_id: str,
correlation_id: str,
jurisdiction: str,
currency: str,
) -> None:
record = {
"_aws": {
"Timestamp": int(datetime.now(timezone.utc).timestamp() * 1000),
"CloudWatchMetrics": [
{
"Namespace": "bank/modules",
"Dimensions": [["module_id", "jurisdiction", "currency"]],
"Metrics": [{"Name": "posting_committed_total", "Unit": "Count"}],
}
],
},
"trace_id": trace_id,
"correlation_id": correlation_id,
"module_id": "MOD-001",
"jurisdiction": jurisdiction,
"currency": currency,
"posting_committed_total": 1,
"level": "INFO",
"event_type": "posting_committed",
"timestamp": datetime.now(timezone.utc).isoformat(),
}
print(json.dumps(record))
Service-level objectives¶
The following SLOs are monitored by MOD-076. Breach of the alert threshold triggers a PagerDuty notification to the on-call engineer. The alert payload includes: SLO name, current value, threshold, module_id, jurisdiction, and a link to the pre-built MOD-076 dashboard for that module.
| SLO | Target | Alert threshold | Owner |
|---|---|---|---|
| API Gateway → posting success rate | ≥ 99.9% over 5 min | < 99.5% | SD01 |
| Posting p99 latency | ≤ 10 ms | > 20 ms | SD01 |
| Balance query p99 latency | ≤ 5 ms | > 10 ms | SD01 / MOD-003 |
| eIDV completion p99 latency | ≤ 8 s | > 12 s | SD02 / MOD-009 |
| Sanctions screening p99 latency | ≤ 500 ms | > 750 ms | SD02 / MOD-013 |
| JWT validation p99 latency | ≤ 10 ms | > 25 ms | SD07 / MOD-044 |
| CDC replication lag | ≤ 5 min | > 5 min | SD07 / MOD-042 |
| Payment API success rate | ≥ 99.9% over 5 min | < 99.5% | SD04 |
| MTTD for anomalous access | ≤ 5 min | — | MOD-076 / NFR-023 |
SLO definitions are owned by the system domain listed. Changes to target or alert threshold require approval from the owning system's tech lead and a corresponding update to this page.
Dashboard standard¶
Every module must have a MOD-076 dashboard provisioned as part of its build. The dashboard is defined as infrastructure code in the module's repo — a CloudWatch dashboard JSON in the IaC directory. It is deployed alongside the module and is a required build artefact; a module is not considered Built without it.
Dashboard name convention: bank-{env}-{module_id} — for example, bank-prod-MOD-001, bank-staging-MOD-009.
Each dashboard must contain:
- Error rate (5-minute rolling window)
- p50 / p95 / p99 latency
- Invocation count
- Throttle count
- Any module-specific custom metrics listed in the required custom metrics table above
- DLQ depth (where the module writes to or reads from a queue)
The standard panel layout and widget configuration are defined in MOD-076's IaC templates. Module IaC imports the template and overrides with module-specific metrics. This ensures consistent layout across dashboards and reduces per-module dashboard work.
Log retention¶
| Log type | Retention | Storage |
|---|---|---|
| Operational Lambda logs (hot) | 90 days | CloudWatch Logs |
| Operational Lambda logs (cold archive) | 7 years | S3 (KMS-encrypted, bank/operational key) |
| Security and auth events | 7 years | CloudWatch Logs + S3 archive |
| Audit trail events | 7 years | Postgres (append-only) + Snowflake replica |
Hot retention of 90 days is the binding requirement (FR-307). CloudWatch Logs retention values are fixed integers; 90 days is a supported option.
Log archival from CloudWatch Logs to S3 is handled by MOD-076 via CloudWatch Logs subscription → Kinesis Firehose → S3. Module code does not manage archival directly. Module code does not configure retention policies on log groups — that is owned by the MOD-076 IaC.
CloudWatch Logs subscription filter limit: AWS enforces a maximum of 2 subscription filters per log group. MOD-076 occupies one filter slot for archival (Firehose). If a log group needs both archival and a second filter (e.g. error pattern routing), those two purposes must share the remaining slot via a combined filter pattern. Module IaC must not add a third filter without first checking existing filter count on the target log group.
The Postgres audit trail (MOD-002) is append-only and not subject to CloudWatch retention rules. It is replicated to Snowflake for analytical access. Operational Lambda logs are not a substitute for the audit trail and must not be used as one.