Technical design — MOD-097 Usage event collector¶
Module: MOD-097 — Usage event collector
System: SD07 — Data Platform & Governance Infrastructure
Repo: bank-platform
FR scope: FR-389, FR-390, FR-391, FR-392
Policies satisfied: REP-001 (LOG)
Author: AI agent (Claude Opus 4.7)
Date: 2026-04-19
Stage covered: dev (account 647751526084, region ap-southeast-2)
SST permalink: pending first deploy
Objective¶
MOD-097 is the platform's usage-event pipeline. Every system domain emits a
bank.platform.usage_event onto the MOD-104-provisioned bank-platform
EventBridge bus whenever a billable action occurs (login, API call, ML
inference, Snowflake query, enrichment call, notification send, document
store, CDC record, decision publication, payment initiated, KYC check).
MOD-097 adds:
- A canonical JSON Schema (draft-04) for
bank.platform.usage_event, registered in MOD-043'sbank-events-{env}schema registry — FR-390. - A normaliser Lambda subscribed to the
bank-platformbus filtered ondetail-type = "usage_event"that normalises each record and PUT_RECORDs to MOD-104'sbank-usage-eventsFirehose — FR-389 / FR-391. - S3 bucket policy on the MOD-104 iceberg bucket denying
DeleteObject/DeleteObjectVersion/BypassGovernanceRetentionon theusage/andusage-agg/prefixes — REP-001 LOG immutability. - A daily aggregator Lambda triggered by EventBridge Scheduler
(cron 01:00 UTC) that runs Athena DDL + UNLOAD queries over the
iceberg
usage/prefix and writes a partitioned Parquet aggregate tousage-agg/date=<YYYY-MM-DD>/— FR-392. - Health alarms on normaliser error rate, normaliser DLQ depth, and Firehose delivery freshness — FR-391 no-loss signal.
- A client-lib helper (
src/lambdas/client-lib/) that other module Lambdas import to emit schema-conformant usage events with a singleemitUsageEvent(...)call.
Execution model¶
| Aspect | Decision |
|---|---|
| IaC tool | SST v3 Ion + raw @pulumi/aws resources (ADR-025) |
| Lambda runtime | Node.js 20 on arm64; normaliser 256 MB / 10s, aggregator 512 MB / 300s |
| Lambda packaging | Local esbuild bundle (CJS, Node-20 target), uploaded via pulumi.asset.FileArchive — same approach as MOD-043 |
| Schema format | JSON Schema draft-04 (matches schema-registry.md) |
| Firehose | Reuses MOD-104's bank-usage-events stream (300s / 128MB buffer, GZIP, iceberg bucket usage/ prefix) |
| Aggregator query | Athena UNLOAD → Parquet/SNAPPY, pay-per-scan |
| Region | ap-southeast-2 |
| Tagging | Provider defaultTags: tenant_id, module_id=MOD-097, environment, system_id=SD07, cost_center=sd07-bank-platform, managed_by=sst |
Stack layout¶
MOD-097-usage-event-collector/
├── sst.config.ts
├── src/
│ ├── stacks/
│ │ ├── usage-schema.ts — registers bank.platform.usage_event schema in MOD-043's registry
│ │ ├── normaliser-lambda.ts — Lambda + role + log group + DLQ + EventBridge rule/target
│ │ ├── daily-aggregator.ts — Lambda + Athena workgroup + Glue DB + Scheduler + results bucket
│ │ ├── s3-immutability.ts — iceberg bucket policy Deny on usage/* + usage-agg/*
│ │ └── health-alarms.ts — normaliser errors / DLQ depth / firehose freshness alarms + SNS topic
│ ├── lambdas/
│ │ ├── normaliser/ — handler: parse → validate → normalise → PutRecord to Firehose
│ │ ├── aggregator/ — handler: Athena DDL + UNLOAD + poll
│ │ └── client-lib/ — emitUsageEvent(...) helper for other modules
│ └── outputs.ts — SSM parameters under /bank/{env}/mod097/...
├── schemas/
│ └── bank.platform.usage_event.json
├── scripts/build-lambdas.mjs
└── __tests__/{unit,integration}
Data flow¶
Any module Lambda
└─ (client-lib) emitUsageEvent() ← FR-389 (≤1s; typical 10-50ms)
└─ EventBridge PutEvents → bus "bank-platform"
├─ MOD-043 governance rule → delivery-logger (validates against registry; FR-390 enforcement)
└─ MOD-097 normaliser rule (detail-type = "usage_event") → normaliser Lambda
└─ Firehose PutRecord → bank-usage-events
└─ iceberg s3://bank-iceberg-{env}/usage/year=../month=../day=../ (FR-391 ≤5min)
├─ REP-001 bucket policy blocks DeleteObject/DeleteBucket
└─ MOD-097 daily aggregator (cron 01:00 UTC)
└─ Athena UNLOAD → s3://bank-iceberg-{env}/usage-agg/date=../ (FR-392)
└─ MOD-099 (downstream) consumes the daily agg for billing
FR / policy coverage¶
| Requirement | How satisfied | Test |
|---|---|---|
| FR-389 (≤1s emission to bus) | PutEvents from emitUsageEvent(...) — single API call |
fr-389-emission-latency.test.ts |
| FR-390 (schema validated) | Schema registered in MOD-043 registry; MOD-043 delivery-logger runs the authoritative validation; MOD-097 normaliser runs shallow validation + DLQs invalid records | fr-390-schema-validation.test.ts, unit normaliser.test.ts |
| FR-391 (≤5min to iceberg; no loss) | Firehose 300s buffer; EventBridge retry(5) + DLQ; normaliser DLQ; alarms on DLQ depth + firehose freshness | fr-391-no-loss.test.ts |
| FR-392 (daily aggregation) | Scheduler-triggered Athena UNLOAD produces usage-agg/date=<d>/*.parquet grouped by tenant/event_type |
fr-392-daily-aggregation.test.ts, unit aggregator-query.test.ts |
| REP-001 LOG (immutability) | Iceberg bucket policy Deny on DeleteObject/DeleteObjectVersion/DeleteBucket for usage/* + usage-agg/* |
pol-rep-001.test.ts |
SSM outputs published¶
All under /bank/{env}/mod097/...:
| Path | Purpose |
|---|---|
usage-schema/name, usage-schema/arn |
Schema registered in MOD-043 registry |
normaliser/arn, normaliser/name |
Lambda identity |
normaliser-dlq/arn, normaliser-dlq/url |
DLQ for async failures |
normaliser-rule/arn |
EventBridge rule on bank-platform bus |
aggregator/arn, aggregator/name |
Aggregator Lambda |
aggregator-schedule/arn |
Daily scheduler |
athena-workgroup/name |
Workgroup with result location |
glue-database/name |
bank_usage_{env} |
athena-results/name |
Results bucket (7d lifecycle) |
alerts-topic/arn |
Health-alarm SNS topic |
alarm/normaliser-errors/arn, alarm/normaliser-dlq/arn, alarm/firehose-freshness/arn |
Alarms |
Cross-module dependencies¶
| Input | Source | Consumed via |
|---|---|---|
bank-platform EventBridge bus ARN |
MOD-104 | mod104.eventbridge.busArn(env, "bank-platform") |
bank-platform DLQ ARN |
MOD-104 | mod104.eventbridge.dlqArn(env, "bank-platform") |
bank-usage-events Firehose ARN |
MOD-104 | mod104.firehose.usageArn(env) |
| iceberg bucket ARN + name | MOD-104 | mod104.s3.icebergArn(env), mod104.s3.name(env, "iceberg") |
| alerts SNS topic | MOD-104 | mod104.sns.alertsArn(env) (reference only) |
bank-events-{env} schema registry |
MOD-043 | hard-coded name (matches schema-registry.md convention) |
Design choices that required judgement¶
-
Schema registration in MOD-043's registry (not a new registry).
schema-registry.mdexplicitly says there is one platform-wide registry. MOD-097 adds its schema into that shared registry so MOD-043's delivery-logger can pick it up automatically on thesource.detail-typename match. -
Immutability via bucket policy, not Object Lock. MOD-104 provisions the iceberg bucket without Object Lock because Iceberg table rewrites and the 7-year Glacier lifecycle transition both require PutObject on existing keys. Adding Object Lock on the bucket would break MOD-104's storage contract. The bucket-policy Deny approach is prefix-scoped (
usage/*+usage-agg/*only) and still satisfies REP-001 LOG immutability obligations for this module's records without affecting other iceberg tenants (e.g. CDC data undercdc/*). The design doc notes this and flags the trade-off: BypassGovernanceRetention is explicitly denied, so a future attempt to add Object Lock on this prefix can coexist with the bucket policy. -
Normaliser shallow validation + DLQ instead of silent drop. MOD-043's delivery-logger is the authoritative JSON-Schema validator. The normaliser does a narrower shallow check (required fields present, enum known, quantity non-negative) and throws on failure so the Lambda retries and DLQs — honouring FR-391 "no event loss" even for malformed payloads. The only records that don't make it to iceberg are the ones the schema registry AND the normaliser both reject, and those are preserved in the DLQ for 14 days.
-
Athena UNLOAD + partition projection instead of Glue Crawler. Glue Crawlers are both a cost drag (crawl per run) and eventually-consistent (the day's partition may not exist at 01:00). Partition projection on year/month/day fields lets Athena query new partitions immediately without a crawl. UNLOAD to
usage-agg/date=<d>/produces Parquet/SNAPPY files that Snowflake can ingest directly (MOD-099 path). -
Aggregator invokable with a
dateoverride. The scheduler runs at 01:00 UTC for "yesterday", but tests need to seed events now and run aggregation against the same day immediately.event.date = "YYYY-MM-DD"gives us that without special-casing prod behaviour. -
Shared
client-libsource undersrc/lambdas/client-lib/. Consumed by future modules viaworkspace:*import once they're packaged (or copy-paste today — keeps the client dependency surface free of external registry pulls). Not bundled by this module's SST deploy — it's a library, not a deployable Lambda.
Wiki corrections recommended¶
- Add
bank.platform.usage_eventtobank-wiki/source/pages/design/system/event-catalogue.md. The catalogue currently lists only 2bank.platform.*events (notification_sent+ the unpublishedsystem_decision_logged). The canonical shape is duplicated inMOD-097.mdbut not on the catalogue. Suggested insert under SD07 section using the schema inMOD-097-usage-event-collector/schemas/bank.platform.usage_event.jsonas the binding reference.
Build gotchas worth institutional memory¶
- EventBridge Scheduler requires a dedicated IAM role with
scheduler.amazonaws.comprincipal. EventBridge Rules useevents.amazonaws.com; the Scheduler is a distinct service. - Athena UNLOAD with
PARTITIONED BYat the CTAS level does not work inside WITH clauses. UNLOAD is a separate statement type — you write to a single output location per run and partition outside via the folder structure (date=<d>/). - Pulumi
aws.s3.BucketPolicyis singular per bucket. MOD-097 has to merge the MOD-104 baseline policy statements (DenyInsecureTransport etc.) into the MOD-097 policy rather than creating a second BucketPolicy — the second would silently replace the first on apply. - Firehose
DeliveryStreamName(not ARN) for PutRecord. The normaliser stores the name (bank-usage-events) in an env var, not the ARN.