ADR-061: COB partitioning for portfolio-scale batch processing¶
| Status | Accepted |
| Date | 2026-05-09 |
| Deciders | CTO, Head of Architecture, Head of Platform Engineering |
| Affects repos | bank-core |
Status: Accepted — 2026-05-09
Context¶
MOD-005 (daily accrual calculator) is currently implemented as a single-Lambda nightly job. It queries all active interest-bearing accounts, calculates the day's accrual for each, and posts accrual entries via MOD-001. This is the correct v1 implementation — simple, auditable, and sufficient for the early account book.
AWS Lambda has a hard execution timeout of 15 minutes. At an assumed rate of ~1,000 accounts per second (conservative estimate for a Neon Postgres query + calculation + MOD-001 posting per account), a single-Lambda COB pass can process approximately 900,000 accounts before hitting the timeout. That is a large but not infinite ceiling. More practically, a single-threaded sequential pass over a growing loan and savings book creates increasing end-to-end completion risk: if the run is slow on one night, the next night's run starts at risk of overrunning the accrual window, creating a compounding problem.
Apache Fineract's Close-of-Business (COB) engine was studied as a design reference (see fineract-design-reference.md). Fineract partitions the loan portfolio across configurable segments and executes each segment in a separate Spring Batch job thread, with a coordinator collecting results. The pattern is well-understood and has been proven across portfolios of millions of accounts in production. The same partitioning principle applies to the wiki's Lambda/Step Functions architecture.
Two approaches were evaluated:
(a) Single-Lambda with chunked processing — continue the current shape but add chunked pagination. Extend Lambda timeout if needed (max 15 min). Reduces risk within the single-Lambda constraint but does not eliminate the fundamental ceiling.
(b) Step Functions parallel partitioning — a coordinator Lambda fans out the account population across N parallel Step Functions branches. Each branch processes one account partition. The coordinator collects completion signals and publishes bank.core.accrual_run_completed when all branches finish.
Decision¶
1. MOD-005 v2 implements partitioned COB via Step Functions¶
MOD-005 v2 (a future build, not an immediate CI task) will replace the single-Lambda sequential pass with a Step Functions Map state executing partitions in parallel. The existing v1 single-Lambda implementation remains in place until v2 is tested and promoted.
The partitioned execution shape:
accrual-scheduler (EventBridge cron)
└─ coordinator.handler
├─ queries total active account count
├─ if count < COB_PARTITION_THRESHOLD → single-partition path (v1 behaviour preserved)
└─ if count ≥ COB_PARTITION_THRESHOLD → Step Functions partition fan-out
├─ partition-1.handler (account_id hash bucket 0-N/P)
├─ partition-2.handler (account_id hash bucket N/P - 2N/P)
├─ ...
└─ partition-P.handler (account_id hash bucket (P-1)N/P - N)
└─ aggregate.handler → publishes accrual_run_completed
2. Partition key and threshold¶
- Partition key:
account_idhash bucket (consistent hash ofaccount_idmod P). This distributes accounts evenly across partitions without requiring a sort index and is stable across runs. - Threshold env var:
COB_PARTITION_THRESHOLD(default:50000). Below this count, the coordinator uses the v1 single-Lambda path. Above it, partitioning activates automatically with no configuration change required. - Partition count env var:
COB_PARTITION_COUNT(default:10). Each partition Lambda is allocated the same reserved concurrency budget.
3. Failure handling and idempotency¶
Each partition posts accruals via MOD-001 using an idempotency key derived from (accrual_date, account_id). If a partition Lambda fails and is retried (Step Functions built-in retry with exponential backoff), MOD-001 deduplicates on the idempotency key and the partition completes cleanly. The coordinator detects partition failure via Step Functions error state and:
- Raises an SNS alarm for the ops team
- Does NOT publish accrual_run_completed until all partitions succeed or the retry budget is exhausted
- On total failure, publishes bank.core.accrual_run_failed with the partition error summary for incident response
4. Monitoring¶
The coordinator records the partition count, per-partition duration, and total duration to CloudWatch. The existing accrual_run_completed event gains two new fields in v2: partition_count (integer, 1 for single-path) and slowest_partition_ms (integer). The alarm threshold on total COB duration remains: completion by 06:00 local time in the respective jurisdiction.
5. v1 compatibility¶
The v1 bank.core.accrual_run_completed event schema is extended additively in v2. Consumers of the v1 schema receive the same fields they rely on. The new partition_count and slowest_partition_ms fields are optional in the schema. No consumer migration is required.
Consequences¶
- MOD-005 is flagged for a v2 partition build. The v2 build is triggered when the active account count in any environment exceeds
COB_PARTITION_THRESHOLDfor three consecutive nightly runs — an automated CloudWatch alarm notifies the platform team. - The
bank.core.accrual_run_completedevent schema is versioned to"2"in the event catalogue when MOD-005 v2 ships. v1 consumers are unaffected by the additive fields. - No immediate action required on MOD-001 — its idempotency key design already supports parallel posting from multiple partition Lambdas.
- This pattern applies to other portfolio-scale batch jobs beyond accruals (e.g., MOD-008 dormancy assessment, MOD-031 ECL recalculation) and should be adopted consistently when those modules encounter the same scale threshold.
All ADRs
Compiled 2026-05-22 from source/entities/adrs/ADR-061.yaml