Skip to content

Technical design — MOD-097 Usage event collector

Module: MOD-097 — Usage event collector System: SD07 — Data Platform & Governance Infrastructure Repo: bank-platform FR scope: FR-389, FR-390, FR-391, FR-392 Policies satisfied: REP-001 (LOG) Author: AI agent (Claude Opus 4.7) Date: 2026-04-19 Stage covered: dev (account 647751526084, region ap-southeast-2) SST permalink: pending first deploy


Objective

MOD-097 is the platform's usage-event pipeline. Every system domain emits a bank.platform.usage_event onto the MOD-104-provisioned bank-platform EventBridge bus whenever a billable action occurs (login, API call, ML inference, Snowflake query, enrichment call, notification send, document store, CDC record, decision publication, payment initiated, KYC check). MOD-097 adds:

  1. A canonical JSON Schema (draft-04) for bank.platform.usage_event, registered in MOD-043's bank-events-{env} schema registry — FR-390.
  2. A normaliser Lambda subscribed to the bank-platform bus filtered on detail-type = "usage_event" that normalises each record and PUT_RECORDs to MOD-104's bank-usage-events Firehose — FR-389 / FR-391.
  3. S3 bucket policy on the MOD-104 iceberg bucket denying DeleteObject / DeleteObjectVersion / BypassGovernanceRetention on the usage/ and usage-agg/ prefixes — REP-001 LOG immutability.
  4. A daily aggregator Lambda triggered by EventBridge Scheduler (cron 01:00 UTC) that runs Athena DDL + UNLOAD queries over the iceberg usage/ prefix and writes a partitioned Parquet aggregate to usage-agg/date=<YYYY-MM-DD>/ — FR-392.
  5. Health alarms on normaliser error rate, normaliser DLQ depth, and Firehose delivery freshness — FR-391 no-loss signal.
  6. A client-lib helper (src/lambdas/client-lib/) that other module Lambdas import to emit schema-conformant usage events with a single emitUsageEvent(...) call.

Execution model

Aspect Decision
IaC tool SST v3 Ion + raw @pulumi/aws resources (ADR-025)
Lambda runtime Node.js 20 on arm64; normaliser 256 MB / 10s, aggregator 512 MB / 300s
Lambda packaging Local esbuild bundle (CJS, Node-20 target), uploaded via pulumi.asset.FileArchive — same approach as MOD-043
Schema format JSON Schema draft-04 (matches schema-registry.md)
Firehose Reuses MOD-104's bank-usage-events stream (300s / 128MB buffer, GZIP, iceberg bucket usage/ prefix)
Aggregator query Athena UNLOAD → Parquet/SNAPPY, pay-per-scan
Region ap-southeast-2
Tagging Provider defaultTags: tenant_id, module_id=MOD-097, environment, system_id=SD07, cost_center=sd07-bank-platform, managed_by=sst

Stack layout

MOD-097-usage-event-collector/
├── sst.config.ts
├── src/
│   ├── stacks/
│   │   ├── usage-schema.ts         — registers bank.platform.usage_event schema in MOD-043's registry
│   │   ├── normaliser-lambda.ts    — Lambda + role + log group + DLQ + EventBridge rule/target
│   │   ├── daily-aggregator.ts     — Lambda + Athena workgroup + Glue DB + Scheduler + results bucket
│   │   ├── s3-immutability.ts      — iceberg bucket policy Deny on usage/* + usage-agg/*
│   │   └── health-alarms.ts        — normaliser errors / DLQ depth / firehose freshness alarms + SNS topic
│   ├── lambdas/
│   │   ├── normaliser/             — handler: parse → validate → normalise → PutRecord to Firehose
│   │   ├── aggregator/             — handler: Athena DDL + UNLOAD + poll
│   │   └── client-lib/             — emitUsageEvent(...) helper for other modules
│   └── outputs.ts                  — SSM parameters under /bank/{env}/mod097/...
├── schemas/
│   └── bank.platform.usage_event.json
├── scripts/build-lambdas.mjs
└── __tests__/{unit,integration}

Data flow

Any module Lambda
  └─ (client-lib) emitUsageEvent()  ← FR-389 (≤1s; typical 10-50ms)
        └─ EventBridge PutEvents → bus "bank-platform"
              ├─ MOD-043 governance rule → delivery-logger (validates against registry; FR-390 enforcement)
              └─ MOD-097 normaliser rule (detail-type = "usage_event") → normaliser Lambda
                    └─ Firehose PutRecord → bank-usage-events
                          └─ iceberg s3://bank-iceberg-{env}/usage/year=../month=../day=../  (FR-391 ≤5min)
                                ├─ REP-001 bucket policy blocks DeleteObject/DeleteBucket
                                └─ MOD-097 daily aggregator (cron 01:00 UTC)
                                      └─ Athena UNLOAD → s3://bank-iceberg-{env}/usage-agg/date=../  (FR-392)
                                            └─ MOD-099 (downstream) consumes the daily agg for billing

FR / policy coverage

Requirement How satisfied Test
FR-389 (≤1s emission to bus) PutEvents from emitUsageEvent(...) — single API call fr-389-emission-latency.test.ts
FR-390 (schema validated) Schema registered in MOD-043 registry; MOD-043 delivery-logger runs the authoritative validation; MOD-097 normaliser runs shallow validation + DLQs invalid records fr-390-schema-validation.test.ts, unit normaliser.test.ts
FR-391 (≤5min to iceberg; no loss) Firehose 300s buffer; EventBridge retry(5) + DLQ; normaliser DLQ; alarms on DLQ depth + firehose freshness fr-391-no-loss.test.ts
FR-392 (daily aggregation) Scheduler-triggered Athena UNLOAD produces usage-agg/date=<d>/*.parquet grouped by tenant/event_type fr-392-daily-aggregation.test.ts, unit aggregator-query.test.ts
REP-001 LOG (immutability) Iceberg bucket policy Deny on DeleteObject/DeleteObjectVersion/DeleteBucket for usage/* + usage-agg/* pol-rep-001.test.ts

SSM outputs published

All under /bank/{env}/mod097/...:

Path Purpose
usage-schema/name, usage-schema/arn Schema registered in MOD-043 registry
normaliser/arn, normaliser/name Lambda identity
normaliser-dlq/arn, normaliser-dlq/url DLQ for async failures
normaliser-rule/arn EventBridge rule on bank-platform bus
aggregator/arn, aggregator/name Aggregator Lambda
aggregator-schedule/arn Daily scheduler
athena-workgroup/name Workgroup with result location
glue-database/name bank_usage_{env}
athena-results/name Results bucket (7d lifecycle)
alerts-topic/arn Health-alarm SNS topic
alarm/normaliser-errors/arn, alarm/normaliser-dlq/arn, alarm/firehose-freshness/arn Alarms

Cross-module dependencies

Input Source Consumed via
bank-platform EventBridge bus ARN MOD-104 mod104.eventbridge.busArn(env, "bank-platform")
bank-platform DLQ ARN MOD-104 mod104.eventbridge.dlqArn(env, "bank-platform")
bank-usage-events Firehose ARN MOD-104 mod104.firehose.usageArn(env)
iceberg bucket ARN + name MOD-104 mod104.s3.icebergArn(env), mod104.s3.name(env, "iceberg")
alerts SNS topic MOD-104 mod104.sns.alertsArn(env) (reference only)
bank-events-{env} schema registry MOD-043 hard-coded name (matches schema-registry.md convention)

Design choices that required judgement

  1. Schema registration in MOD-043's registry (not a new registry). schema-registry.md explicitly says there is one platform-wide registry. MOD-097 adds its schema into that shared registry so MOD-043's delivery-logger can pick it up automatically on the source.detail-type name match.

  2. Immutability via bucket policy, not Object Lock. MOD-104 provisions the iceberg bucket without Object Lock because Iceberg table rewrites and the 7-year Glacier lifecycle transition both require PutObject on existing keys. Adding Object Lock on the bucket would break MOD-104's storage contract. The bucket-policy Deny approach is prefix-scoped (usage/* + usage-agg/* only) and still satisfies REP-001 LOG immutability obligations for this module's records without affecting other iceberg tenants (e.g. CDC data under cdc/*). The design doc notes this and flags the trade-off: BypassGovernanceRetention is explicitly denied, so a future attempt to add Object Lock on this prefix can coexist with the bucket policy.

  3. Normaliser shallow validation + DLQ instead of silent drop. MOD-043's delivery-logger is the authoritative JSON-Schema validator. The normaliser does a narrower shallow check (required fields present, enum known, quantity non-negative) and throws on failure so the Lambda retries and DLQs — honouring FR-391 "no event loss" even for malformed payloads. The only records that don't make it to iceberg are the ones the schema registry AND the normaliser both reject, and those are preserved in the DLQ for 14 days.

  4. Athena UNLOAD + partition projection instead of Glue Crawler. Glue Crawlers are both a cost drag (crawl per run) and eventually-consistent (the day's partition may not exist at 01:00). Partition projection on year/month/day fields lets Athena query new partitions immediately without a crawl. UNLOAD to usage-agg/date=<d>/ produces Parquet/SNAPPY files that Snowflake can ingest directly (MOD-099 path).

  5. Aggregator invokable with a date override. The scheduler runs at 01:00 UTC for "yesterday", but tests need to seed events now and run aggregation against the same day immediately. event.date = "YYYY-MM-DD" gives us that without special-casing prod behaviour.

  6. Shared client-lib source under src/lambdas/client-lib/. Consumed by future modules via workspace:* import once they're packaged (or copy-paste today — keeps the client dependency surface free of external registry pulls). Not bundled by this module's SST deploy — it's a library, not a deployable Lambda.


  • Add bank.platform.usage_event to bank-wiki/source/pages/design/system/event-catalogue.md. The catalogue currently lists only 2 bank.platform.* events (notification_sent + the unpublished system_decision_logged). The canonical shape is duplicated in MOD-097.md but not on the catalogue. Suggested insert under SD07 section using the schema in MOD-097-usage-event-collector/schemas/bank.platform.usage_event.json as the binding reference.

Build gotchas worth institutional memory

  1. EventBridge Scheduler requires a dedicated IAM role with scheduler.amazonaws.com principal. EventBridge Rules use events.amazonaws.com; the Scheduler is a distinct service.
  2. Athena UNLOAD with PARTITIONED BY at the CTAS level does not work inside WITH clauses. UNLOAD is a separate statement type — you write to a single output location per run and partition outside via the folder structure (date=<d>/).
  3. Pulumi aws.s3.BucketPolicy is singular per bucket. MOD-097 has to merge the MOD-104 baseline policy statements (DenyInsecureTransport etc.) into the MOD-097 policy rather than creating a second BucketPolicy — the second would silently replace the first on apply.
  4. Firehose DeliveryStreamName (not ARN) for PutRecord. The normaliser stores the name (bank-usage-events) in an env var, not the ARN.