MOD-041 — Categorisation & merchant enrichment model¶
Module: MOD-041 (categorisation-model) System: SD06 (risk-platform) Repo: bank-risk-platform Phase: 1 ADR: ADR-017 — In-house ML categorisation model
Purpose¶
Snowflake-native ML pipeline that takes every customer transaction
(CDC-replicated from SD01 accounts.postings and SD04
payments.payments) and produces an enriched record with normalised
merchant name, location-level merchant identity, MCC, spend category
L1/L2, and per-location static map image URL. Two-model architecture
per ADR-017:
- Merchant normaliser — rule engine first (regex on top-30 NZ/AU v1 location patterns; expanding to top 100–150 over time), then embedding fallback against canonical chain names for unknown descriptors.
- Category classifier — XGBoost on tabular features (MCC, amount, hour-of-day, day-of-week, channel, country, chain token).
Confidence-routed: ≥0.85 silent auto, 0.60–0.84 nudge, <0.60 "Other". Retrained weekly via Snowflake Cortex ML on customer corrections (CDC'd back from SD01). Champion/challenger governance per DT-005.
The merchant directory is location-level (per scope ruling k-4): "Caltex Parnell" and "Caltex Takapuna" are different rows, each with their own lat/lng and pre-generated static map image. The map image gives the customer a sense-of-place on the payment-detail screen of the mobile app.
Functional + non-functional scope¶
| Code | Requirement |
|---|---|
| FR-237 | ≥85% transactions categorised within 60s of CDC arrival in the Snowflake raw layer (clock starts post-CDC, NOT from payment_created in Postgres — per scope ruling k-1). |
| FR-238 | Enrich with normalised merchant_name + MCC + logo URL (and per-location merchant_location_map_url where available — k-4); store linked to posting_id. |
| FR-239 | Customer re-categorisation captured as labelled training signal. SD01-side write path is forward-looking; this module's stg_customer_corrections is bootstrap-resilient until that ships (k-3). |
| FR-240 | Aggregated category spending visible to insight feed within 60s of categorisation, via bank.transactions.categorised event published by the EB-publisher Lambda (k-2). |
| NFR-022 | Auto-rate ≥85% at 0.85 confidence, monitored via v_category_metrics_daily.auto_rate_pct and the alert_categorisation_model_drift SNS alarm. |
| NFR-004 | Insight card refresh ≤60s — covered by FR-240 EB event timing. |
| NFR-013 | Postgres read p99 ≤5ms — not directly binding here (no Postgres write-back; SD01 owns enrichment write-back). |
| Policy | Mode | How satisfied |
|---|---|---|
| CON-005 | AUTO | Categorised transactions surface a normalised merchant_name (not raw acquirer string) and a meaningful category_l1 — verified by tests/policy/CON-005-auto.test.ts (structural) + tests/integration/fr-237-fr-238-enrichment.test.ts (runtime). |
| DT-005 | LOG | Model versioned in MOD_041_MODEL_VERSIONS (champion/challenger lineage); retrained weekly via Cortex; performance monitored via v_category_metrics_daily and the drift alert. Append-only — tests/policy/DT-005-log.test.ts (structural) + tests/integration/snowflake-immutability.test.ts (runtime grant verification). |
Architecture¶
Module type¶
Hybrid: DCM v2 declarative + dbt models + Python UDFs (out-of-band per ADR-054 §"DCM v2 limitations") + Lambda for EventBridge publication + S3/CloudFront for merchant media assets.
ADR-056 readiness (greenfield from day 1)¶
has_dbt_project: truefrom initial deploy —snow dbt deploy/executeruns in CI.- profiles.yml hybrid
env_var('NAME', 'placeholder-default')form works for both Snowflake-native + legacy dbt-core paths. - All seed data via dbt seed CSVs (COPY-INTO lineage) — no MERGE...VALUES; the 091905 saga from MOD-035 is the precedent.
- Per-table grants only; no
GRANT ON ALL(RISK_CUSTOMER co-residency with MOD-039 + MOD-040). target.warehouseon the DT (no hardcodedPROD_DBT_WH).- Both confidence columns (
categorisation_confidence,normalisation_confidence) typed as NUMBER(5,4) not FLOAT — pre-emptive per the FLOAT-vs-NUMBER ruling on this module's scope.
Data flow¶
SD01 accounts.postings ──┐
├─ CDC (MOD-042) → BANK_{ENV}_CORE.RAW / BANK_{ENV}_PAYMENTS.RAW
SD04 payments.payments ──┘ │
▼
stg_postings + stg_payments (bootstrap-resilient via adapter.get_relation)
│
▼
merchant_directory ◀── stg_merchants ◀──┘
(dbt seed CSV) (CTAS table)
│
▼
int_normalised_merchant (calls NORMALISE_MERCHANT UDF)
│
▼
int_features (joins directory for chain_name + logo_url + map_image_url)
│
▼
┌── categorised_transactions ───┐ (DT — INCREMENTAL, target_lag = 1 minute, calls CATEGORISE_TRANSACTION UDF)
│ (Snowflake DT) │
▼ ▼
v_categorised_current v_category_summary_daily
v_category_metrics_daily (per-customer × category × day spend)
│
│ ↑ drift alert reads metrics view (MOD-076 SNS — k-7)
▼
MOD_041_EB_PUBLISH_CURSOR ◀── 1-min schedule ── EB Publisher Lambda
│
▼
bank-risk-platform EventBridge bus
(bank.transactions.categorised events)
│
┌─────────────────────────────────────────┴──┐
▼ ▼
SD08 dashboard insight feed SD01 enrichment write-back
(FR-240) (FR-238 — populates merchant_canonical, etc.)
Refresh cadence rationale (k-1)¶
The DT's target_lag = '1 minute' directly contradicts the SD06 wiki
data model spec's 30 minutes. FR-237 requires ≥85% of transactions
categorised within 60 seconds of CDC arrival; a 30-minute lag fails
this by definition. This module's design doc takes precedence over
the SD06 wiki text; see the
MOD-041-data-model-gap.handoff.md outbound handoff for the wiki
amendment request.
The cost implication of 1-minute vs 30-minute refresh is real: ~30×
more frequent DT refresh on NONPROD_WH (dev) and PROD_DBT_WH
(prod). Sized into the warehouse capacity plan. CTAS-style daily
spend aggregation views (v_category_summary_daily) are evaluated
on-demand by consumers — not materialised — so the refresh cost is
the DT only.
Snowflake objects owned¶
| Object | Type | Owner | Notes |
|---|---|---|---|
RISK_CUSTOMER schema |
(consumed) | MOD-039 | Co-resident; define schema NOT redeclared (DCM v2 single-project ownership rule, MOD-035 lesson). MOD-041's dcm/pre-dbt.sql only does CREATE SCHEMA IF NOT EXISTS defensively. |
CATEGORISATION_TRAINING |
TABLE (DCM) | this module | Append-only training labels (FR-239 + bootstrap labels). DT-005 LOG — no UPDATE/DELETE/TRUNCATE grants. |
MOD_041_MODEL_VERSIONS |
TABLE (DCM) | this module | Champion/challenger lineage. Append-only. BANK_DBT_ROLE: SELECT + INSERT (the retrain task writes promotion records). |
MOD_041_EB_PUBLISH_CURSOR |
TABLE (DCM) | this module | EB publisher cursor — operational, not audit. Owner role has UPDATE; BANK_DBT_ROLE has SELECT only. |
MOD_041_MODEL_ARTEFACTS |
STAGE (DCM) | this module | Internal stage for pickled XGBoost + merchant directory snapshot. SSE-KMS. BANK_DBT_ROLE: READ. |
CATEGORISE_TRANSACTION(VARCHAR, NUMBER, NUMBER, NUMBER, VARCHAR, VARCHAR, VARCHAR) |
FUNCTION (Python — out-of-band) | this module | XGBoost UDF. CREATE OR REPLACE on each pnpm snowflake:apply --phase=post-model. |
NORMALISE_MERCHANT(VARCHAR, VARCHAR, VARCHAR) |
FUNCTION (Python — out-of-band) | this module | Rule + embedding UDF. Reads merchant directory snapshot pinned via IMPORTS. |
TASK_CATEGORISATION_EVENT_DRIVEN |
TASK (DCM) | this module | 1-minute schedule; bumps cursor's last_run_at. |
ALERT_CATEGORISATION_MODEL_DRIFT |
ALERT (DCM, post-dbt) | this module | Routes to MOD-076 SNS alarm-intake (k-7) when daily auto_rate_pct < 85. |
CATEGORISED_TRANSACTIONS |
DYNAMIC TABLE (dbt) | this module | INCREMENTAL, target_lag = 1 minute, cluster = (customer_id, transaction_date). |
V_CATEGORISED_CURRENT |
VIEW (dbt) | this module | Published contract per ADR-046 §3 — k-8 column set. |
V_CATEGORY_SUMMARY_DAILY |
VIEW (dbt) | this module | Per-customer × category × day spend (auto-confidence rows only). |
V_CATEGORY_METRICS_DAILY |
VIEW (dbt) | this module | DT-005 monitoring source for the drift alert. |
MERCHANT_DIRECTORY |
TABLE (dbt seed) | this module | Location-level (~30 rows v1, expanding to 100–150). COPY-INTO lineage breaks the 091905 trap. |
MCC_CATEGORY_MAPPING |
TABLE (dbt seed) | this module | ISO 18245 → bank category default mapping. |
STG_MERCHANTS |
TABLE (dbt CTAS) | this module | Materialised TABLE not VIEW so the DT-feeding chain doesn't trip Snowflake-native dbt's view-over-DT constraint (MOD-035 lesson). |
Out-of-band Snowflake objects (legacy infra/snowflake/ apply)¶
| Object | Reason |
|---|---|
| Python UDFs (categorise_transaction, normalise_merchant) | DCM v2 doesn't manage non-SQL UDFs (002041). Same MOD-039 score_customer pattern. |
MOD_041_MODEL_ARTEFACTS stage (bootstrap copy) |
DCM owns the canonical declaration in dcm/sources/definitions/stage_model_artefacts.sql but DCM deploy is gated by DCM plan, and plan fails on grants.sql referencing the Python UDFs (which need the stage to PUT artefacts to). infra/snowflake/stages/mod_041_model_artefacts.sql runs CREATE STAGE IF NOT EXISTS ahead of DCM plan via the pre-DCM-plan hook. Idempotent against subsequent DCM plans. Per bank-wiki issue #22. |
| (No streams) | The MOD-039 pattern uses streams; MOD-041 uses a 1-minute scheduled task instead — stream not required. |
Pre-DCM-plan deploy chain (per bank-wiki issue #22)¶
scripts/pre-dcm-plan.sh runs before snow dcm plan to deploy the
Python UDFs the DCM plan references. The bank-platform CI template
detects this script and calls it (Part B of issue #22) instead of
the generic pnpm snowflake:apply fallback. Chain:
pip install -r scripts/requirements-train.txt— numpy + xgboost + scikit-learn.pnpm snowflake:apply --phase=pre-model— creates the bootstrap stage.pnpm model:train— deterministic XGBoost training, writesdist/model-<v>.pkl+.model_card.json.pnpm model:upload— PUTs both artefacts, MERGEsMODEL_VERSIONSrow.pnpm snowflake:apply --phase=post-model— deploys the two Python UDFs withMODEL_VERSIONread from the model card.
Step 3+4 are idempotent on the model_version hash (same code → same
artefact bytes → no-op re-deploy). The chain is also the local-dev
path for forcing a fresh model: STAGE=dev bash scripts/pre-dcm-plan.sh.
SSM outputs¶
| Path | Value | Consumer |
|---|---|---|
/bank/{env}/risk-platform/categorisation/categorised-current-view |
RISK_CUSTOMER.V_CATEGORISED_CURRENT |
SD08 dashboard insight feed (FR-240); SD01 enrichment write-back (FR-238); future MOD-077 correction flow (FR-239). |
/bank/{env}/risk-platform/categorisation/category-summary-view |
RISK_CUSTOMER.V_CATEGORY_SUMMARY_DAILY |
SD08 dashboard insight feed spend-by-category (FR-240). |
/bank/{env}/risk-platform/categorisation/publisher-lambda-arn |
Lambda ARN | Ops dashboards + CloudWatch alarms. |
/bank/{env}/risk-platform/merchant-assets/cdn-base-url |
https://<cf-id>.cloudfront.net |
Publisher Lambda (prefixes logo_url / merchant_location_map_url paths); mobile app dashboard; back office. k-5. |
/bank/{env}/risk-platform/merchant-assets/s3-bucket |
bank-merchant-assets-{env} |
scripts/generate-merchant-maps.ts data-ops tool; manual logo upload. Bucket provisioned by MOD-104 bootstrap post-2026-05-13 (GOV-005 SCP blocks s3:CreateBucket for bank-platform-cicd). |
SSM consumed¶
| Path | Origin |
|---|---|
/bank/{env}/snowflake/databases/risk |
MOD-102 |
/bank/{env}/snowflake/account-locator |
MOD-102 |
/bank/{env}/snowflake/warehouses/etl |
MOD-102 |
/bank/{env}/snowflake/roles/domain-{nonprod\|prod} |
MOD-102 |
/bank/{env}/iam/lambda/bank-risk-platform/arn |
MOD-104 |
/bank/{env}/observability/adot-layer-arn |
MOD-104 |
/bank/{env}/eventbridge/bank-risk-platform/arn |
MOD-104 |
/bank/{env}/eventbridge/bank-risk-platform/dlq-arn |
MOD-104 |
/bank/{env}/sns/alarm-intake/arn |
MOD-076 — referenced by alert_categorisation_model_drift (k-7) |
/bank/{env}/kms/operational/arn |
MOD-104 — for SSE-KMS on the merchant-assets bucket |
Events¶
Published¶
bank.transactions.categorised (Source: bank.risk-platform,
DetailType: transactions_categorised)
Per scope rulings k-2 + k-8. JSON Schema authoritative source:
docs/event-schemas/bank.transactions.categorised.v1.schema.json.
{
// Standard envelope (ADR-051)
"event_id": "uuid",
"event_time": "2026-05-08T01:00:01.234Z",
"schema_version": "1.0",
"trace_id": "uuid",
"idempotency_key": "sha256(posting_id|model_version)",
// Categorisation payload
"posting_id": "p_001",
"customer_id": "c_001",
"jurisdiction": "NZ",
"merchant_name": "Caltex – Parnell", // nullable (k-6)
"logo_url": "https://<cdn>/logos/caltex.png", // nullable (k-6); CDN-prefixed
"merchant_location_map_url": "https://<cdn>/maps/caltex-parnell.jpg", // nullable (k-6)
"mcc": "5541", // nullable
"category_l1": "TRANSPORT",
"category_l2": "FUEL", // nullable (k-6)
"categorisation_confidence": "0.9700", // decimal-string (NUMBER(5,4))
"model_version": "v2026-05-08-abc123"
}
Subscribers (defined in their own modules' Pulumi projects):
- SD08 dashboard insight feed — FR-240 spend-by-category refresh
- SD01 enrichment write-back — FR-238 populate merchant_canonical,
merchant_logo_id, category on the source posting
Consumed¶
None. FR-239 customer corrections come via MOD-042 CDC, NOT EventBridge.
Operational¶
Warehouse sizing¶
The 1-minute target_lag (k-1) means the DT refreshes every minute
on NONPROD_WH (dev/uat) or PROD_DBT_WH (prod). Per dbt-snowflake
Dynamic Table semantics, an INCREMENTAL refresh that finds zero
upstream changes is sub-second; a refresh with new rows scales with
the change volume.
Rough capacity estimate at 50 transactions/minute steady-state: - Per-refresh featurise + categorise: ~50 UDF calls × ~10ms = ~500ms - Per-refresh DT MERGE: ~100ms write - Total compute time: ~30s per hour ≈ 0.04% of one X-Small warehouse-hour
Even at 10× that (peak/burst), the DT cost is negligible against the existing dbt build cost.
Drift alert response (per k-7)¶
The alert_categorisation_model_drift alert fires hourly when
yesterday's auto_rate_pct < 85.00 (NFR-022 floor). When it fires:
- Inspect the
v_category_metrics_dailyrow that fired the alert. Single-day blip vs multi-day trend? - Check
correction_rate_pctin the same view — drift in auto-rate often correlates with a spike in customer corrections (the leading indicator). - Look at
categorisation_trainingfor recent label distribution shift (e.g. unusual concentration in new merchant patterns or a new MCC code seen for the first time). - Trigger a manual retrain via
STAGE=<env> bash scripts/pre-dcm-plan.sh(or equivalentlypnpm pre-dcm-planwith STAGE exported) if the trend is sustained — don't wait for the weekly schedule. The script runs the full train + upload + post-model chain.
If the drift is caused by upstream data quality (e.g. CDC paused,
MOD-042 stream behind) the alert is misleading; check
v_category_metrics_daily.total_transactions first.
Curating the merchant directory¶
Add a new location (per scope ruling k-4):
- Edit
MOD-041-categorisation-model/seeds/merchant_directory.csv— add the new location row with lat/lng populated andmap_image_urlblank. - Run
MAPBOX_ACCESS_TOKEN=… pnpm merchant:generate-maps --stage=dev --provider=mapbox(operator's local environment). - Review
seeds/merchant_directory.patch.csv— copy the newmap_image_urlvalues back intoseeds/merchant_directory.csv. - Commit. Next CI dispatch re-seeds the directory and the new location is live.
Logo uploads are manual — the merchant-curation team uploads
logos/<chain>.png directly to the S3 bucket
(/bank/{env}/risk-platform/merchant-assets/s3-bucket SSM points at
bank-merchant-assets-{env}). Logos rarely change so no CI
automation.
Why the drift alert routes to MOD-076 specifically¶
Per scope ruling k-7: same pattern as MOD-030 / MOD-065 DLQ depth alarms. MOD-076 owns the SNS alarm-intake topic that fans out to the on-call PagerDuty + Slack channels. Per-module SNS topics would fragment the on-call experience.