Skip to content

Technical design — MOD-062 Workflow orchestration engine

Module: MOD-062 — Workflow orchestration engine System: SD07 — Data Platform & Governance Infrastructure Repo: bank-platform FR scope: FR-293, FR-294, FR-295, FR-296 Policies satisfied: none declared Author: AI agent (Claude Opus 4.7) Date: 2026-04-19 Stage covered: dev — not deployed this pass (tests run locally; deploy gated on orchestrator review)


Objective

MOD-062 is the durable runtime that executes multi-step business processes (account opening, payment, KYC uplift, complaint handling). It replaces ad-hoc cross-module glue with a declarative graph of steps, transitions and decision gates that can pause for human approval, retry on failure, and escalate when stuck.

Each workflow is a YAML file under src/workflows/*.yaml. At deploy time the compiler projects each YAML into an Amazon States Language (ASL) definition and instantiates an AWS Step Functions standard state machine. The state machine invokes a shared transition-handler Lambda for every step; the handler records the transition in DynamoDB, publishes a workflow_step_completed event to the owning domain's EventBridge bus, and (for approval gates) enqueues a durable ticket that resumes the workflow on response.

The module covers FR-293 through FR-296:

  1. FR-293 — Every workflow declares its step sequence, decision gates, and SLAs in YAML. The compiler validates the shape at deploy time; any missing SLA or undefined next fails the build.
  2. FR-294 — Every transition publishes bank.platform.workflow_step_completed to the owning domain's bus with workflow_id, step_name, step_result, and actor.
  3. FR-295 — Human-approval steps enqueue a ticket carrying the Step Functions task token onto the approval queue. Step Functions holds the execution paused until a SendTaskSuccess / SendTaskFailure token callback arrives — pause is durable across Lambda restarts.
  4. FR-296 — Each TASK step retries 3× with exponential backoff inside Step Functions; on exhaustion the ASL Catch routes to the retry queue. The retry queue redrives up to its own maxReceiveCount and then lands in the retry DLQ, which triggers the escalation-notifier → MOD-076 alerts topic.

Execution model

Aspect Decision
IaC tool SST v3 Ion + raw @pulumi/aws resources (ADR-025)
Orchestrator AWS Step Functions STANDARD — durable, long-running, with token callbacks
Lambda runtime Node.js 20 on arm64; 256 MB; timeouts 10–60 s
State store DynamoDB PAY_PER_REQUEST with TTL + PITR + stream
Queues SQS with KMS (alias/aws/sqs); approval 12h visibility, retry 60s
Schema format YAML workflow definitions validated at compile time
Region ap-southeast-2
Tagging module_id=MOD-062, system_id=SD07, cost_center=sd07-bank-platform

Stack layout

MOD-062-workflow-orchestration/
├── sst.config.ts
├── scripts/build-lambdas.mjs
├── src/
│   ├── workflow-compiler.ts         — YAML → ASL transform (pure)
│   ├── logger.ts                    — structured log helper
│   ├── outputs.ts                   — SSM parameter publication
│   ├── stacks/
│   │   ├── state-store.ts           — DynamoDB table bank-workflow-state
│   │   ├── approval-queue.ts        — SQS approval queue + DLQ
│   │   ├── retry-queue.ts           — SQS retry queue + DLQ (maxReceiveCount 3)
│   │   ├── workflow-lambdas.ts      — five Lambdas + shared IAM role
│   │   ├── state-machines.ts        — Step Functions state machines from YAML
│   │   └── event-wiring.ts          — inbound approval_recorded rule
│   ├── lambdas/
│   │   ├── workflow-starter/        — POST /workflow/start
│   │   ├── transition-handler/      — core: record + publish + gate
│   │   ├── approval-recorder/       — resume on approver decision
│   │   ├── retry-processor/         — apply backoff, re-emit or exhaust
│   │   └── escalation-notifier/     — DLQ → SNS alerts
│   └── workflows/
│       ├── account-opening.yaml
│       └── kyc-uplift.yaml
└── __tests__/
    ├── unit/
    │   ├── workflow-compiler.test.ts
    │   ├── retry-backoff.test.ts
    │   ├── structured-log.test.ts
    │   └── idempotency.test.ts
    └── integration/
        ├── fr-293-workflow-definition.test.ts
        ├── fr-294-event-publication.test.ts
        ├── fr-295-approval-pause.test.ts
        └── fr-296-retry-escalation.test.ts

Workflow definition format

A workflow YAML has:

id: account-opening            # state machine name root
version: "1.0.0"               # tagged on the state machine for versioning
owning_domain: bank-core       # where transition events publish (FR-294)
start_at: collect-application
steps:
  - name: run-kyc
    type: TASK                 # TASK | APPROVAL | DECISION | WAIT | SUCCESS | FAIL
    sla_seconds: 120           # FR-293 — required for TASK + APPROVAL
    next: kyc-outcome          # terminal steps omit this
  - name: kyc-outcome
    type: DECISION
    branches:
      - when: CLEAR
        goto: credit-check
      - when: REFER
        goto: manual-approval
    default: abort
  - name: manual-approval
    type: APPROVAL             # pauses via Step Functions task token
    sla_seconds: 43200
    next: provision-account
  - name: completed
    type: SUCCESS
  - name: abort
    type: FAIL

The compiler emits a Step Functions ASL definition where every TASK / APPROVAL step invokes the shared transition-handler Lambda, every DECISION becomes a Choice state keyed on $.decision, and every TASK has a Retry block (MaxAttempts: 3, BackoffRate: 2) plus a Catch that routes to a __RetryEscalation__ state.


AWS resources provisioned (dev stage — planned)

State store (durable pause/resume)

Resource Name Notes
aws.dynamodb.Table bank-workflow-state-dev PK pk=workflow_id, SK sk. GSI by-type-status. TTL attribute ttl. PITR on. Stream NEW_AND_OLD_IMAGES. SSE enabled

Queues (FR-295 + FR-296)

Resource Name Retention Redrive
Approval queue bank-workflow-approval-dev 14 d maxReceiveCount: 5 → approval DLQ
Approval DLQ bank-workflow-approval-dlq-dev 14 d
Retry queue bank-workflow-retry-dev 4 d maxReceiveCount: 3 → retry DLQ
Retry DLQ bank-workflow-retry-dlq-dev 14 d → escalation-notifier Lambda

Lambdas

Name Purpose Timeout
bank-workflow-workflow-starter-dev Starts a new workflow execution 10 s
bank-workflow-transition-handler-dev Step task: record + publish + gate 30 s
bank-workflow-approval-recorder-dev EventBridge → SendTaskSuccess / SendTaskFailure 15 s
bank-workflow-retry-processor-dev SQS retry queue consumer 60 s
bank-workflow-escalation-notifier-dev Retry DLQ consumer → SNS alerts 10 s

Each Lambda writes to its own CloudWatch log group with 90-day retention (observability standard) and has X-Ray tracing enabled (ADR-031). All five share the role bank-workflow-lambdas-dev, which grants scoped access to DynamoDB, the two SQS queues, the alerts SNS topic, Step Functions token callbacks, and events:PutEvents on the 8 domain buses.

State machines

One Step Functions state machine per YAML workflow, named bank-workflow-<id>-<env>. STANDARD type (long duration), ALL-level logging to /aws/vendedlogs/states/…, X-Ray tracing enabled, and a workflow_version tag stamped from the YAML version field for traceability (FR-740 workflow versioning is deferred — the tag is informational only; no in-flight version pinning is implemented yet).

Inbound event wiring

Resource Value
aws.cloudwatch.EventRule bank-workflow-approval-recorded-dev — matches detail-type approval_recorded / workflow_approval_recorded on any bank.* source on the bank-platform bus
Target approval-recorder Lambda, RetryPolicy.MaximumRetryAttempts=3, DLQ = MOD-104 platform DLQ

Outbound events (FR-294) are published directly by the transition-handler Lambda via events:PutEvents — no rule is required on the publish side.


SSM outputs table (consumer contract)

All under /bank/{env}/mod062/....

SSM path Value Consumed by
/bank/{env}/mod062/state-table/name DynamoDB table name Modules that query workflow state
/bank/{env}/mod062/state-table/arn Table ARN IAM policies in downstream modules
/bank/{env}/mod062/approval-queue/arn Approval queue ARN Staff app / ops console (ticket producer)
/bank/{env}/mod062/approval-queue/url Approval queue URL Staff app consumer
/bank/{env}/mod062/approval-dlq/arn Approval DLQ ARN MOD-076 alarms
/bank/{env}/mod062/retry-queue/arn Retry queue ARN Producers outside Step Functions (rare)
/bank/{env}/mod062/retry-queue/url Retry queue URL Ops tooling
/bank/{env}/mod062/retry-dlq/arn Retry DLQ ARN MOD-076 alarms
/bank/{env}/mod062/lambda/workflow-starter/arn Lambda ARN MOD-075 API gateway target
/bank/{env}/mod062/lambda/transition-handler/arn Lambda ARN State machines, ops
/bank/{env}/mod062/lambda/approval-recorder/arn Lambda ARN Event rule target, ops
/bank/{env}/mod062/lambda/retry-processor/arn Lambda ARN Ops
/bank/{env}/mod062/lambda/escalation-notifier/arn Lambda ARN Ops
/bank/{env}/mod062/state-machine/{workflow-id}/arn State machine ARN Workflow-starter callers; versioning audit
/bank/{env}/mod062/rule/approval-recorded/arn EventBridge rule ARN MOD-076 dashboards

Acceptance criteria status (local run, 2026-04-19)

pnpm test (from MOD-062-workflow-orchestration/):

FR / Gate Tests Pass Fail Status
FR-293 — documented step sequence + decision gates + SLAs 4 4 0 PASS
FR-294 — transitions publish events to owning domain bus 2 2 0 PASS
FR-295 — approval pause + durable resume 3 3 0 PASS
FR-296 — 3 retries then escalate 3 3 0 PASS
Unit: workflow-compiler 8 8 0 PASS
Unit: retry-backoff 6 6 0 PASS
Unit: structured-log 4 4 0 PASS
Unit: idempotency 2 2 0 PASS
Total 32 32 0 100%

Lambda module quality gates (methodology.md §Quality gates): unit ≥80% coverage, one test per FR, structured-log format test, idempotency test — all satisfied.


Operational notes

  • Deploy: AWS_PROFILE=bank-dev pnpm -F @bank-platform/mod-062-workflow-orchestration run deploy --stage <env>
  • deploy script runs pnpm build:lambda (esbuild + workflow YAML copy) before sst deploy.
  • Remove: AWS_PROFILE=bank-dev pnpm -F @bank-platform/mod-062-workflow-orchestration run remove --stage <env>
  • Cost envelope: DynamoDB PAY_PER_REQUEST + 5 Lambdas + 4 SQS queues + 2 state machines ≈ $5–10/month at dev volumes.

Stubs / deferred work

Stub Owner Replace with
MOD-075 API Gateway → workflow-starter MOD-075 HTTP API integration once Phase 3 lands
FR-740 workflow versioning at the execution layer MOD-062 Phase 2 (deferred) Today the version tag is stamped on the state machine but in-flight executions always resolve to the currently-aliased state machine. Version pinning at execution start (the FR-740 scope) is a follow-up.
Visibility timeout tuning for the approval queue Operations 12 h is a safe default; real SLA bands will need profiling

  • Wiki spec: bank-wiki/source/entities/modules/MOD-062.{yaml,md}
  • Handoff: docs/handoffs/MOD-062-complete.handoff.md
  • Methodology: https://bank-wiki.pages.dev/delivery/methodology/
  • Event catalogue: https://bank-wiki.pages.dev/design/system/event-catalogue/ (see bank.platform.workflow_step_completed)
  • Error handling: https://bank-wiki.pages.dev/design/system/error-handling-standard/
  • ADRs in effect: ADR-023, ADR-025, ADR-029, ADR-031