Skip to content

Error handling and resilience standard

Related: Delivery methodology · Observability standard · Interface contracts · MOD-076


Error classification

Every error a module can encounter must be classified before deciding how to handle it. The four error classes are:

Class Description Retryable Examples
TRANSIENT_INFRA Temporary infrastructure unavailability Yes DB connection timeout, Lambda cold start throttle, downstream Lambda 429, external API 503
PROVIDER_ERROR External provider returned an error response (not a timeout) Conditionally DVS/DIA 500, Onfido 422, sanctions list provider 503 — retry if provider indicates temporary; do not retry if provider returns a definitive result
VALIDATION_FAILURE Input does not conform to schema or business rules No Missing required field, invalid currency code, unbalanced posting legs, unsupported document type
COMPLIANCE_BLOCK A regulatory gate prevented the action No — the condition must be resolved first KYC not verified, active sanctions match, account not active, insufficient scope

Every exception thrown in a Lambda must be caught at the outermost handler, classified, and produce: (1) a structured error log entry per the observability standard, (2) the correct error envelope response or event routing, and (3) a metric increment via EMF.

Standard error envelope

The authoritative shape is defined in interface-contracts.md. It is reproduced here for reference:

{
  "error_code":      "ACCOUNT_NOT_ACTIVE",
  "error_message":   "Account a1b2c3d4 is in PENDING state — payment cannot be initiated.",
  "request_id":      "b12c3d4e-5f67-8901-abcd-ef1234567890",
  "idempotency_key": "pay-9988776655",
  "retryable":       false
}

error_code is drawn from each module's declared error code enumeration (documented in the module's design doc). It is a stable machine key — never a free-text string. error_message is for operator and developer consumption only; it must never be surfaced to the customer.

Synchronous error handling (API Gateway → Lambda)

The following pattern is the required structure for the outermost Lambda handler on any synchronous endpoint:

def handler(event, context):
    trace_id, correlation_id = extract_trace_context(event, context)
    try:
        result = process(event, trace_id)
        return {"statusCode": 200, "body": json.dumps(result)}
    except ValidationError as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(422, e.error_code, str(e), retryable=False)
    except ComplianceBlock as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(403, e.error_code, str(e), retryable=False)
    except TransientError as e:
        log_error(trace_id, correlation_id, e, retryable=True)
        return error_response(503, "SERVICE_UNAVAILABLE", str(e), retryable=True)
    except Exception as e:
        log_error(trace_id, correlation_id, e, retryable=False)
        return error_response(500, "INTERNAL_ERROR", "Unexpected error", retryable=False)

Rules:

  • HTTP 422 for validation failures and compliance blocks — the request was understood but rejected
  • HTTP 503 for transient infra errors — the caller should retry with the retryable: true signal
  • HTTP 500 for unexpected exceptions — never retried; requires investigation
  • HTTP 401 and 403 are returned by the MOD-044 authorizer — module code never sets these directly
  • Never return a stack trace in the response body — log it, return an opaque request_id to the caller

Asynchronous error handling (EventBridge consumer)

EventBridge rule targets are Lambda functions. On Lambda error — any unhandled exception or explicit failure — EventBridge retries per the retry policy configured on the target. After all retries are exhausted, the event is sent to the target's dead-letter SQS queue. DLQs are provisioned per system domain by MOD-104.

Retry policy

Scenario Max attempts Backoff strategy Notes
Transient infrastructure error 3 Exponential, 1 s base, max 30 s Lambda must propagate the failure (raise, not return) to trigger the EventBridge retry
External provider rate limit 5 Exponential, 5 s base, max 120 s Detect via HTTP 429 from provider
Business rule or compliance failure 1 None Will fail again — route to DLQ immediately; do not consume retries on unresolvable failures

Business rule failures must be caught and converted to a dead-letter write before raising. The module should write a structured failure record directly to the DLQ SQS queue via boto3, rather than relying on the EventBridge retry chain for compliance blocks. This prevents consuming retry budget on events that cannot succeed.

DLQ processing

Each system domain has one DLQ provisioned by MOD-104, named bank-{domain}-dlq-{env}.

DLQ processing follows this sequence:

  1. MOD-076 monitors DLQ depth and fires a dlq.depth_breach alert when depth exceeds 0 for more than 5 minutes.
  2. The on-call engineer inspects the dead-lettered message. The message body contains: original event, error class, error_code, trace_id, module_id, and attempt count.
  3. Resolution paths: (a) fix the upstream data condition and replay the event; (b) mark as permanently failed and write a structured rejection record to the audit trail; (c) escalate to compliance if the DLQ message relates to a compliance block.

Poison pill handling

A poison pill is an event that will never succeed regardless of retries — it is structurally malformed, references a deleted entity, or carries an unresolvable business state.

After 3 retry failures for any event that is not a transient error, the consuming Lambda must:

  1. Write a processing.poison_pill_detected audit log entry with the full event payload, error classification, and trace_id.
  2. Move the event to the permanent-failure SQS queue (bank-{domain}-failed-{env}).
  3. Return success (do not raise again) — this prevents the event cycling endlessly through the retry chain.

Partial failure in multi-step flows

Multi-step flows — for example, payment initiation: validate → post → publish event → update state — must treat partial completion as a failure requiring rollback or compensation.

Rules:

  1. Prefer atomic units. If two writes can be placed in the same Postgres transaction, do so. Prefer atomic transactions over multi-step sequences wherever possible.
  2. Compensation over rollback. For steps that have already committed to durable stores (Postgres), issue a compensating action (reversal posting, state revert) rather than relying on distributed rollback.
  3. Idempotency gates. Every step in a multi-step flow checks its idempotency key before executing — so replaying a partially completed flow does not re-execute completed steps. The idempotency_keys table is defined in methodology.md.
  4. Publish-last. EventBridge event publication is always the final step — after all durable writes have committed. A published event signals that all upstream state is consistent.
  5. Never publish a partially completed state. If the Postgres commit succeeded but the EventBridge publish failed, the module must retry the publish (using the same event payload reconstructed from the committed Postgres record) — not republish with a new event_id.

Payment initiation example:

1. Check idempotency_key → already processed? return stored result.
2. Validate payment (MOD-020 checks) → ValidationError? return 422, no side effects.
3. BEGIN TRANSACTION
   a. Write idempotency_key record (status=IN_PROGRESS)
   b. Post debit/credit legs (MOD-001)
   c. Write payment audit record (MOD-022)
   d. Update idempotency_key record (status=COMPLETE, result=payload)
   COMMIT — all or nothing
4. Publish bank.core.posting_completed to EventBridge.
   → If publish fails: retry up to 3 times (event payload is deterministic from committed record).
   → If all retries fail: write to DLQ with reconstruct=true flag; background job retries.
5. Return 200 with posting_id.

External provider resilience

For calls to external providers (DVS, DIA, Onfido, Equifax, sanctions list providers, SWIFT, Akahu):

Pattern Implementation When to apply
Timeout Set explicit timeout_seconds on every HTTP call (default: 8 s for identity providers; 3 s for payment providers) Always
Retry with backoff 3 attempts, exponential backoff, 1 s base TRANSIENT_INFRA and provider 5xx only
Provider fallback Alternative provider preconfigured (e.g. Onfido → alternative liveness provider) Where a fallback provider is registered in module config
Outcome degradation For non-blocking checks: if provider unavailable, treat as PENDING_EDD (not failed) and schedule retry eIDV only — never for sanctions screening, which must block on provider failure
Hard fail on sanctions If the sanctions provider is unavailable after all retries, the payment or onboarding must be blocked — do not proceed without a screening result MOD-013, MOD-020

Timeouts and retry counts are module-level config values, not hardcoded. They are documented in each module's design document and overridable via environment variables.

Circuit breaker pattern

There is no circuit breaker library in use. The pattern is implemented manually in modules that call high-failure-risk external providers.

The circuit breaker is a three-state machine with shared state stored in DynamoDB:

  • Closed: all calls pass through normally.
  • Open: calls fail fast without hitting the provider — returns a cached result or PENDING_EDD/block depending on provider type. The circuit opens after N consecutive failures within a 60-second window (N is configurable, default 5).
  • Half-open: after a configurable cool-down (default 30 s), one probe call is allowed through. If it succeeds, the circuit closes. If it fails, the circuit re-opens.

The state record in DynamoDB uses the key {module_id}#{provider} with the following fields:

Field Type Description
state CLOSED | OPEN | HALF_OPEN Current circuit state
failure_count int Consecutive failures in the current window
opened_at ISO 8601 timestamp When the circuit last opened
last_probe_at ISO 8601 timestamp When the last half-open probe was attempted

Circuit breaker state is global per Lambda fleet, not per Lambda instance. DynamoDB is the shared state store. The circuit breaker DynamoDB table is provisioned by MOD-104.

Bulkhead

Lambda reserved concurrency is the bulkhead mechanism. Each system domain's Lambda execution role has a reserved concurrency limit set in the module's IaC, preventing one domain's traffic spike from starving another domain's execution capacity.

Recommended reserved concurrency baseline (adjustable per load profile):

Domain Reserved concurrency
SD01 Core Banking 100
SD02 KYC 50
SD03 AML 50
SD04 Payments 150
SD05 Credit 30
SD07 Platform 50
SD08 App 200

Reserved concurrency values are set in each module's IaC and reviewed quarterly. They are not managed centrally — each module owner is responsible for the value declared in their module's infrastructure configuration.