Technical design — MOD-062 Workflow orchestration engine¶
Module: MOD-062 — Workflow orchestration engine
System: SD07 — Data Platform & Governance Infrastructure
Repo: bank-platform
FR scope: FR-293, FR-294, FR-295, FR-296
Policies satisfied: none declared
Author: AI agent (Claude Opus 4.7)
Date: 2026-04-19
Stage covered: dev — not deployed this pass (tests run locally; deploy gated on orchestrator review)
Objective¶
MOD-062 is the durable runtime that executes multi-step business processes (account opening, payment, KYC uplift, complaint handling). It replaces ad-hoc cross-module glue with a declarative graph of steps, transitions and decision gates that can pause for human approval, retry on failure, and escalate when stuck.
Each workflow is a YAML file under src/workflows/*.yaml. At deploy time the compiler projects each YAML into an Amazon States Language (ASL) definition and instantiates an AWS Step Functions standard state machine. The state machine invokes a shared transition-handler Lambda for every step; the handler records the transition in DynamoDB, publishes a workflow_step_completed event to the owning domain's EventBridge bus, and (for approval gates) enqueues a durable ticket that resumes the workflow on response.
The module covers FR-293 through FR-296:
- FR-293 — Every workflow declares its step sequence, decision gates, and SLAs in YAML. The compiler validates the shape at deploy time; any missing SLA or undefined
nextfails the build. - FR-294 — Every transition publishes
bank.platform.workflow_step_completedto the owning domain's bus withworkflow_id,step_name,step_result, andactor. - FR-295 — Human-approval steps enqueue a ticket carrying the Step Functions task token onto the approval queue. Step Functions holds the execution paused until a
SendTaskSuccess/SendTaskFailuretoken callback arrives — pause is durable across Lambda restarts. - FR-296 — Each TASK step retries 3× with exponential backoff inside Step Functions; on exhaustion the ASL
Catchroutes to the retry queue. The retry queue redrives up to its ownmaxReceiveCountand then lands in the retry DLQ, which triggers the escalation-notifier → MOD-076 alerts topic.
Execution model¶
| Aspect | Decision |
|---|---|
| IaC tool | SST v3 Ion + raw @pulumi/aws resources (ADR-025) |
| Orchestrator | AWS Step Functions STANDARD — durable, long-running, with token callbacks |
| Lambda runtime | Node.js 20 on arm64; 256 MB; timeouts 10–60 s |
| State store | DynamoDB PAY_PER_REQUEST with TTL + PITR + stream |
| Queues | SQS with KMS (alias/aws/sqs); approval 12h visibility, retry 60s |
| Schema format | YAML workflow definitions validated at compile time |
| Region | ap-southeast-2 |
| Tagging | module_id=MOD-062, system_id=SD07, cost_center=sd07-bank-platform |
Stack layout¶
MOD-062-workflow-orchestration/
├── sst.config.ts
├── scripts/build-lambdas.mjs
├── src/
│ ├── workflow-compiler.ts — YAML → ASL transform (pure)
│ ├── logger.ts — structured log helper
│ ├── outputs.ts — SSM parameter publication
│ ├── stacks/
│ │ ├── state-store.ts — DynamoDB table bank-workflow-state
│ │ ├── approval-queue.ts — SQS approval queue + DLQ
│ │ ├── retry-queue.ts — SQS retry queue + DLQ (maxReceiveCount 3)
│ │ ├── workflow-lambdas.ts — five Lambdas + shared IAM role
│ │ ├── state-machines.ts — Step Functions state machines from YAML
│ │ └── event-wiring.ts — inbound approval_recorded rule
│ ├── lambdas/
│ │ ├── workflow-starter/ — POST /workflow/start
│ │ ├── transition-handler/ — core: record + publish + gate
│ │ ├── approval-recorder/ — resume on approver decision
│ │ ├── retry-processor/ — apply backoff, re-emit or exhaust
│ │ └── escalation-notifier/ — DLQ → SNS alerts
│ └── workflows/
│ ├── account-opening.yaml
│ └── kyc-uplift.yaml
└── __tests__/
├── unit/
│ ├── workflow-compiler.test.ts
│ ├── retry-backoff.test.ts
│ ├── structured-log.test.ts
│ └── idempotency.test.ts
└── integration/
├── fr-293-workflow-definition.test.ts
├── fr-294-event-publication.test.ts
├── fr-295-approval-pause.test.ts
└── fr-296-retry-escalation.test.ts
Workflow definition format¶
A workflow YAML has:
id: account-opening # state machine name root
version: "1.0.0" # tagged on the state machine for versioning
owning_domain: bank-core # where transition events publish (FR-294)
start_at: collect-application
steps:
- name: run-kyc
type: TASK # TASK | APPROVAL | DECISION | WAIT | SUCCESS | FAIL
sla_seconds: 120 # FR-293 — required for TASK + APPROVAL
next: kyc-outcome # terminal steps omit this
- name: kyc-outcome
type: DECISION
branches:
- when: CLEAR
goto: credit-check
- when: REFER
goto: manual-approval
default: abort
- name: manual-approval
type: APPROVAL # pauses via Step Functions task token
sla_seconds: 43200
next: provision-account
- name: completed
type: SUCCESS
- name: abort
type: FAIL
The compiler emits a Step Functions ASL definition where every TASK / APPROVAL step invokes the shared transition-handler Lambda, every DECISION becomes a Choice state keyed on $.decision, and every TASK has a Retry block (MaxAttempts: 3, BackoffRate: 2) plus a Catch that routes to a __RetryEscalation__ state.
AWS resources provisioned (dev stage — planned)¶
State store (durable pause/resume)¶
| Resource | Name | Notes |
|---|---|---|
aws.dynamodb.Table |
bank-workflow-state-dev |
PK pk=workflow_id, SK sk. GSI by-type-status. TTL attribute ttl. PITR on. Stream NEW_AND_OLD_IMAGES. SSE enabled |
Queues (FR-295 + FR-296)¶
| Resource | Name | Retention | Redrive |
|---|---|---|---|
| Approval queue | bank-workflow-approval-dev |
14 d | maxReceiveCount: 5 → approval DLQ |
| Approval DLQ | bank-workflow-approval-dlq-dev |
14 d | — |
| Retry queue | bank-workflow-retry-dev |
4 d | maxReceiveCount: 3 → retry DLQ |
| Retry DLQ | bank-workflow-retry-dlq-dev |
14 d | → escalation-notifier Lambda |
Lambdas¶
| Name | Purpose | Timeout |
|---|---|---|
bank-workflow-workflow-starter-dev |
Starts a new workflow execution | 10 s |
bank-workflow-transition-handler-dev |
Step task: record + publish + gate | 30 s |
bank-workflow-approval-recorder-dev |
EventBridge → SendTaskSuccess / SendTaskFailure |
15 s |
bank-workflow-retry-processor-dev |
SQS retry queue consumer | 60 s |
bank-workflow-escalation-notifier-dev |
Retry DLQ consumer → SNS alerts | 10 s |
Each Lambda writes to its own CloudWatch log group with 90-day retention (observability standard) and has X-Ray tracing enabled (ADR-031). All five share the role bank-workflow-lambdas-dev, which grants scoped access to DynamoDB, the two SQS queues, the alerts SNS topic, Step Functions token callbacks, and events:PutEvents on the 8 domain buses.
State machines¶
One Step Functions state machine per YAML workflow, named bank-workflow-<id>-<env>. STANDARD type (long duration), ALL-level logging to /aws/vendedlogs/states/…, X-Ray tracing enabled, and a workflow_version tag stamped from the YAML version field for traceability (FR-740 workflow versioning is deferred — the tag is informational only; no in-flight version pinning is implemented yet).
Inbound event wiring¶
| Resource | Value |
|---|---|
aws.cloudwatch.EventRule |
bank-workflow-approval-recorded-dev — matches detail-type approval_recorded / workflow_approval_recorded on any bank.* source on the bank-platform bus |
| Target | approval-recorder Lambda, RetryPolicy.MaximumRetryAttempts=3, DLQ = MOD-104 platform DLQ |
Outbound events (FR-294) are published directly by the transition-handler Lambda via events:PutEvents — no rule is required on the publish side.
SSM outputs table (consumer contract)¶
All under /bank/{env}/mod062/....
| SSM path | Value | Consumed by |
|---|---|---|
/bank/{env}/mod062/state-table/name |
DynamoDB table name | Modules that query workflow state |
/bank/{env}/mod062/state-table/arn |
Table ARN | IAM policies in downstream modules |
/bank/{env}/mod062/approval-queue/arn |
Approval queue ARN | Staff app / ops console (ticket producer) |
/bank/{env}/mod062/approval-queue/url |
Approval queue URL | Staff app consumer |
/bank/{env}/mod062/approval-dlq/arn |
Approval DLQ ARN | MOD-076 alarms |
/bank/{env}/mod062/retry-queue/arn |
Retry queue ARN | Producers outside Step Functions (rare) |
/bank/{env}/mod062/retry-queue/url |
Retry queue URL | Ops tooling |
/bank/{env}/mod062/retry-dlq/arn |
Retry DLQ ARN | MOD-076 alarms |
/bank/{env}/mod062/lambda/workflow-starter/arn |
Lambda ARN | MOD-075 API gateway target |
/bank/{env}/mod062/lambda/transition-handler/arn |
Lambda ARN | State machines, ops |
/bank/{env}/mod062/lambda/approval-recorder/arn |
Lambda ARN | Event rule target, ops |
/bank/{env}/mod062/lambda/retry-processor/arn |
Lambda ARN | Ops |
/bank/{env}/mod062/lambda/escalation-notifier/arn |
Lambda ARN | Ops |
/bank/{env}/mod062/state-machine/{workflow-id}/arn |
State machine ARN | Workflow-starter callers; versioning audit |
/bank/{env}/mod062/rule/approval-recorded/arn |
EventBridge rule ARN | MOD-076 dashboards |
Acceptance criteria status (local run, 2026-04-19)¶
pnpm test (from MOD-062-workflow-orchestration/):
| FR / Gate | Tests | Pass | Fail | Status |
|---|---|---|---|---|
| FR-293 — documented step sequence + decision gates + SLAs | 4 | 4 | 0 | PASS |
| FR-294 — transitions publish events to owning domain bus | 2 | 2 | 0 | PASS |
| FR-295 — approval pause + durable resume | 3 | 3 | 0 | PASS |
| FR-296 — 3 retries then escalate | 3 | 3 | 0 | PASS |
| Unit: workflow-compiler | 8 | 8 | 0 | PASS |
| Unit: retry-backoff | 6 | 6 | 0 | PASS |
| Unit: structured-log | 4 | 4 | 0 | PASS |
| Unit: idempotency | 2 | 2 | 0 | PASS |
| Total | 32 | 32 | 0 | 100% |
Lambda module quality gates (methodology.md §Quality gates): unit ≥80% coverage, one test per FR, structured-log format test, idempotency test — all satisfied.
Operational notes¶
- Deploy:
AWS_PROFILE=bank-dev pnpm -F @bank-platform/mod-062-workflow-orchestration run deploy --stage <env> deployscript runspnpm build:lambda(esbuild + workflow YAML copy) beforesst deploy.- Remove:
AWS_PROFILE=bank-dev pnpm -F @bank-platform/mod-062-workflow-orchestration run remove --stage <env> - Cost envelope: DynamoDB PAY_PER_REQUEST + 5 Lambdas + 4 SQS queues + 2 state machines ≈ $5–10/month at dev volumes.
Stubs / deferred work¶
| Stub | Owner | Replace with |
|---|---|---|
MOD-075 API Gateway → workflow-starter |
MOD-075 | HTTP API integration once Phase 3 lands |
| FR-740 workflow versioning at the execution layer | MOD-062 Phase 2 (deferred) | Today the version tag is stamped on the state machine but in-flight executions always resolve to the currently-aliased state machine. Version pinning at execution start (the FR-740 scope) is a follow-up. |
| Visibility timeout tuning for the approval queue | Operations | 12 h is a safe default; real SLA bands will need profiling |
Related artefacts¶
- Wiki spec:
bank-wiki/source/entities/modules/MOD-062.{yaml,md} - Handoff:
docs/handoffs/MOD-062-complete.handoff.md - Methodology: https://bank-wiki.pages.dev/delivery/methodology/
- Event catalogue: https://bank-wiki.pages.dev/design/system/event-catalogue/ (see
bank.platform.workflow_step_completed) - Error handling: https://bank-wiki.pages.dev/design/system/error-handling-standard/
- ADRs in effect: ADR-023, ADR-025, ADR-029, ADR-031