Error handling and resilience standard¶

Error classification¶

Every error a module can encounter must be classified before deciding how to handle it. The four error classes are:

Class	Description	Retryable	Examples
`TRANSIENT_INFRA`	Temporary infrastructure unavailability	Yes	DB connection timeout, Lambda cold start throttle, downstream Lambda 429, external API 503
`PROVIDER_ERROR`	External provider returned an error response (not a timeout)	Conditionally	DVS/DIA 500, Onfido 422, sanctions list provider 503 — retry if provider indicates temporary; do not retry if provider returns a definitive result
`VALIDATION_FAILURE`	Input does not conform to schema or business rules	No	Missing required field, invalid currency code, unbalanced posting legs, unsupported document type
`COMPLIANCE_BLOCK`	A regulatory gate prevented the action	No — the condition must be resolved first	KYC not verified, active sanctions match, account not active, insufficient scope

Every exception thrown in a Lambda must be caught at the outermost handler, classified, and produce: (1) a structured error log entry per the observability standard, (2) the correct error envelope response or event routing, and (3) a metric increment via EMF.

Standard error envelope¶

The authoritative shape is defined in interface-contracts.md. It is reproduced here for reference:

{
  "error_code":      "ACCOUNT_NOT_ACTIVE",
  "error_message":   "Account a1b2c3d4 is in PENDING state — payment cannot be initiated.",
  "request_id":      "b12c3d4e-5f67-8901-abcd-ef1234567890",
  "idempotency_key": "pay-9988776655",
  "retryable":       false
}

error_code is drawn from each module's declared error code enumeration (documented in the module's design doc). It is a stable machine key — never a free-text string. error_message is for operator and developer consumption only; it must never be surfaced to the customer.

Synchronous error handling (API Gateway → Lambda)¶

The following pattern is the required structure for the outermost Lambda handler on any synchronous endpoint:

def handler(event, context):
    trace_id, correlation_id = extract_trace_context(event, context)
    try:
        result = process(event, trace_id)
        return {"statusCode": 200, "body": json.dumps(result)}
    except ValidationError as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(422, e.error_code, str(e), retryable=False)
    except ComplianceBlock as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(403, e.error_code, str(e), retryable=False)
    except TransientError as e:
        log_error(trace_id, correlation_id, e, retryable=True)
        return error_response(503, "SERVICE_UNAVAILABLE", str(e), retryable=True)
    except Exception as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(500, "INTERNAL_ERROR", "Unexpected error", retryable=False)

Rules:

HTTP 422 for validation failures and compliance blocks — the request was understood but rejected
HTTP 503 for transient infra errors — the caller should retry with the retryable: true signal
HTTP 500 for unexpected exceptions — never retried; requires investigation
HTTP 401 and 403 are returned by the MOD-044 authorizer — module code never sets these directly
Never return a stack trace in the response body — log it, return an opaque request_id to the caller

Asynchronous error handling (EventBridge consumer)¶

EventBridge rule targets are Lambda functions. On Lambda error — any unhandled exception or explicit failure — EventBridge retries per the retry policy configured on the target. After all retries are exhausted, the event is sent to the target's dead-letter SQS queue. DLQs are provisioned per system domain by MOD-104.

Retry policy¶

Scenario	Max attempts	Backoff strategy	Notes
Transient infrastructure error	3	Exponential, 1 s base, max 30 s	Lambda must propagate the failure (raise, not return) to trigger the EventBridge retry
External provider rate limit	5	Exponential, 5 s base, max 120 s	Detect via HTTP 429 from provider
Business rule or compliance failure	1	None	Will fail again — route to DLQ immediately; do not consume retries on unresolvable failures

Business rule failures must be caught and converted to a dead-letter write before raising. The module should write a structured failure record directly to the DLQ SQS queue via boto3, rather than relying on the EventBridge retry chain for compliance blocks. This prevents consuming retry budget on events that cannot succeed.

DLQ processing¶

Each system domain has one DLQ provisioned by MOD-104, named bank-{domain}-dlq-{env}.

DLQ processing follows this sequence:

MOD-076 monitors DLQ depth and fires a dlq.depth_breach alert when depth exceeds 0 for more than 5 minutes.
The on-call engineer inspects the dead-lettered message. The message body contains: original event, error class, error_code, trace_id, module_id, and attempt count.
Resolution paths: (a) fix the upstream data condition and replay the event; (b) mark as permanently failed and write a structured rejection record to the audit trail; (c) escalate to compliance if the DLQ message relates to a compliance block.

Poison pill handling¶

A poison pill is an event that will never succeed regardless of retries — it is structurally malformed, references a deleted entity, or carries an unresolvable business state.

After 3 retry failures for any event that is not a transient error, the consuming Lambda must:

Write a processing.poison_pill_detected audit log entry with the full event payload, error classification, and trace_id.
Move the event to the permanent-failure SQS queue (bank-{domain}-failed-{env}).
Return success (do not raise again) — this prevents the event cycling endlessly through the retry chain.

Partial failure in multi-step flows¶

Multi-step flows — for example, payment initiation: validate → post → publish event → update state — must treat partial completion as a failure requiring rollback or compensation.

Rules:

Prefer atomic units. If two writes can be placed in the same Postgres transaction, do so. Prefer atomic transactions over multi-step sequences wherever possible.
Compensation over rollback. For steps that have already committed to durable stores (Postgres), issue a compensating action (reversal posting, state revert) rather than relying on distributed rollback.
Idempotency gates. Every step in a multi-step flow checks its idempotency key before executing — so replaying a partially completed flow does not re-execute completed steps. The idempotency_keys table is defined in methodology.md.
Publish-last. EventBridge event publication is always the final step — after all durable writes have committed. A published event signals that all upstream state is consistent.
Never publish a partially completed state. If the Postgres commit succeeded but the EventBridge publish failed, the module must retry the publish (using the same event payload reconstructed from the committed Postgres record) — not republish with a new event_id.

Payment initiation example:

1. Check idempotency_key → already processed? return stored result.
2. Validate payment (MOD-020 checks) → ValidationError? return 422, no side effects.
3. BEGIN TRANSACTION
   a. Write idempotency_key record (status=IN_PROGRESS)
   b. Post debit/credit legs (MOD-001)
   c. Write payment audit record (MOD-022)
   d. Update idempotency_key record (status=COMPLETE, result=payload)
   COMMIT — all or nothing
4. Publish bank.core.posting_completed to EventBridge.
   → If publish fails: retry up to 3 times (event payload is deterministic from committed record).
   → If all retries fail: write to DLQ with reconstruct=true flag; background job retries.
5. Return 200 with posting_id.

External provider resilience¶

For calls to external providers (DVS, DIA, Onfido, Equifax, sanctions list providers, SWIFT, Akahu):

Pattern	Implementation	When to apply
Timeout	Set explicit `timeout_seconds` on every HTTP call (default: 8 s for identity providers; 3 s for payment providers)	Always
Retry with backoff	3 attempts, exponential backoff, 1 s base	`TRANSIENT_INFRA` and provider 5xx only
Provider fallback	Alternative provider preconfigured (e.g. Onfido → alternative liveness provider)	Where a fallback provider is registered in module config
Outcome degradation	For non-blocking checks: if provider unavailable, treat as `PENDING_EDD` (not failed) and schedule retry	eIDV only — never for sanctions screening, which must block on provider failure
Hard fail on sanctions	If the sanctions provider is unavailable after all retries, the payment or onboarding must be blocked — do not proceed without a screening result	MOD-013, MOD-020

Timeouts and retry counts are module-level config values, not hardcoded. They are documented in each module's design document and overridable via environment variables.

Circuit breaker pattern¶

There is no circuit breaker library in use. The pattern is implemented manually in modules that call high-failure-risk external providers.

The circuit breaker is a three-state machine with shared state stored in DynamoDB:

Closed: all calls pass through normally.
Open: calls fail fast without hitting the provider — returns a cached result or PENDING_EDD/block depending on provider type. The circuit opens after N consecutive failures within a 60-second window (N is configurable, default 5).
Half-open: after a configurable cool-down (default 30 s), one probe call is allowed through. If it succeeds, the circuit closes. If it fails, the circuit re-opens.

The state record in DynamoDB uses the key {module_id}#{provider} with the following fields:

Field	Type	Description
`state`	`CLOSED` \| `OPEN` \| `HALF_OPEN`	Current circuit state
`failure_count`	int	Consecutive failures in the current window
`opened_at`	ISO 8601 timestamp	When the circuit last opened
`last_probe_at`	ISO 8601 timestamp	When the last half-open probe was attempted

Circuit breaker state is global per Lambda fleet, not per Lambda instance. DynamoDB is the shared state store. The circuit breaker DynamoDB table is provisioned by MOD-104.

Bulkhead¶

Lambda reserved concurrency is the bulkhead mechanism. Each system domain's Lambda execution role has a reserved concurrency limit set in the module's IaC, preventing one domain's traffic spike from starving another domain's execution capacity.

Recommended reserved concurrency baseline (adjustable per load profile):

Domain	Reserved concurrency
SD01 Core Banking	100
SD02 KYC	50
SD03 AML	50
SD04 Payments	150
SD05 Credit	30
SD07 Platform	50
SD08 App	200

Reserved concurrency values are set in each module's IaC and reviewed quarterly. They are not managed centrally — each module owner is responsible for the value declared in their module's infrastructure configuration.