MOD-041 — Categorisation & merchant enrichment model¶

Module: MOD-041 (categorisation-model) System: SD06 (risk-platform) Repo: bank-risk-platform Phase: 1 ADR: ADR-017 — In-house ML categorisation model

Purpose¶

Snowflake-native ML pipeline that takes every customer transaction (CDC-replicated from SD01 accounts.postings and SD04 payments.payments) and produces an enriched record with normalised merchant name, location-level merchant identity, MCC, spend category L1/L2, and per-location static map image URL. Two-model architecture per ADR-017:

Merchant normaliser — rule engine first (regex on top-30 NZ/AU v1 location patterns; expanding to top 100–150 over time), then embedding fallback against canonical chain names for unknown descriptors.
Category classifier — XGBoost on tabular features (MCC, amount, hour-of-day, day-of-week, channel, country, chain token).

Confidence-routed: ≥0.85 silent auto, 0.60–0.84 nudge, <0.60 "Other". Retrained weekly via Snowflake Cortex ML on customer corrections (CDC'd back from SD01). Champion/challenger governance per DT-005.

The merchant directory is location-level (per scope ruling k-4): "Caltex Parnell" and "Caltex Takapuna" are different rows, each with their own lat/lng and pre-generated static map image. The map image gives the customer a sense-of-place on the payment-detail screen of the mobile app.

Functional + non-functional scope¶

Code	Requirement
FR-237	≥85% transactions categorised within 60s of CDC arrival in the Snowflake raw layer (clock starts post-CDC, NOT from payment_created in Postgres — per scope ruling k-1).
FR-238	Enrich with normalised merchant_name + MCC + logo URL (and per-location merchant_location_map_url where available — k-4); store linked to posting_id.
FR-239	Customer re-categorisation captured as labelled training signal. SD01-side write path is forward-looking; this module's `stg_customer_corrections` is bootstrap-resilient until that ships (k-3).
FR-240	Aggregated category spending visible to insight feed within 60s of categorisation, via `bank.transactions.categorised` event published by the EB-publisher Lambda (k-2).
NFR-022	Auto-rate ≥85% at 0.85 confidence, monitored via `v_category_metrics_daily.auto_rate_pct` and the `alert_categorisation_model_drift` SNS alarm.
NFR-004	Insight card refresh ≤60s — covered by FR-240 EB event timing.
NFR-013	Postgres read p99 ≤5ms — not directly binding here (no Postgres write-back; SD01 owns enrichment write-back).

Policy	Mode	How satisfied
CON-005	AUTO	Categorised transactions surface a normalised `merchant_name` (not raw acquirer string) and a meaningful `category_l1` — verified by `tests/policy/CON-005-auto.test.ts` (structural) + `tests/integration/fr-237-fr-238-enrichment.test.ts` (runtime).
DT-005	LOG	Model versioned in `MOD_041_MODEL_VERSIONS` (champion/challenger lineage); retrained weekly via Cortex; performance monitored via `v_category_metrics_daily` and the drift alert. Append-only — `tests/policy/DT-005-log.test.ts` (structural) + `tests/integration/snowflake-immutability.test.ts` (runtime grant verification).

Architecture¶

Module type¶

Hybrid: DCM v2 declarative + dbt models + Python UDFs (out-of-band per ADR-054 §"DCM v2 limitations") + Lambda for EventBridge publication + S3/CloudFront for merchant media assets.

ADR-056 readiness (greenfield from day 1)¶

has_dbt_project: true from initial deploy — snow dbt deploy/execute runs in CI.
profiles.yml hybrid env_var('NAME', 'placeholder-default') form works for both Snowflake-native + legacy dbt-core paths.
All seed data via dbt seed CSVs (COPY-INTO lineage) — no MERGE...VALUES; the 091905 saga from MOD-035 is the precedent.
Per-table grants only; no GRANT ON ALL (RISK_CUSTOMER co-residency with MOD-039 + MOD-040).
target.warehouse on the DT (no hardcoded PROD_DBT_WH).
Both confidence columns (categorisation_confidence, normalisation_confidence) typed as NUMBER(5,4) not FLOAT — pre-emptive per the FLOAT-vs-NUMBER ruling on this module's scope.

Data flow¶

SD01 accounts.postings ──┐
                         ├─ CDC (MOD-042) → BANK_{ENV}_CORE.RAW / BANK_{ENV}_PAYMENTS.RAW
SD04 payments.payments ──┘                              │
                                                         ▼
                                  stg_postings + stg_payments (bootstrap-resilient via adapter.get_relation)
                                                         │
                                                         ▼
            merchant_directory  ◀──  stg_merchants   ◀──┘
            (dbt seed CSV)            (CTAS table)
                                                         │
                                                         ▼
                                  int_normalised_merchant   (calls NORMALISE_MERCHANT UDF)
                                                         │
                                                         ▼
                                  int_features              (joins directory for chain_name + logo_url + map_image_url)
                                                         │
                                                         ▼
                          ┌── categorised_transactions ───┐  (DT — INCREMENTAL, target_lag = 1 minute, calls CATEGORISE_TRANSACTION UDF)
                          │       (Snowflake DT)          │
                          ▼                                ▼
              v_categorised_current             v_category_summary_daily
              v_category_metrics_daily          (per-customer × category × day spend)
                          │
                          │   ↑    drift alert reads metrics view (MOD-076 SNS — k-7)
                          ▼
              MOD_041_EB_PUBLISH_CURSOR ◀── 1-min schedule ── EB Publisher Lambda
                                                                    │
                                                                    ▼
                                                            bank-risk-platform EventBridge bus
                                                            (bank.transactions.categorised events)
                                                                    │
                          ┌─────────────────────────────────────────┴──┐
                          ▼                                            ▼
                    SD08 dashboard insight feed             SD01 enrichment write-back
                    (FR-240)                                (FR-238 — populates merchant_canonical, etc.)

Refresh cadence rationale (k-1)¶

The DT's target_lag = '1 minute' directly contradicts the SD06 wiki data model spec's 30 minutes. FR-237 requires ≥85% of transactions categorised within 60 seconds of CDC arrival; a 30-minute lag fails this by definition. This module's design doc takes precedence over the SD06 wiki text; see the MOD-041-data-model-gap.handoff.md outbound handoff for the wiki amendment request.

The cost implication of 1-minute vs 30-minute refresh is real: ~30× more frequent DT refresh on NONPROD_WH (dev) and PROD_DBT_WH (prod). Sized into the warehouse capacity plan. CTAS-style daily spend aggregation views (v_category_summary_daily) are evaluated on-demand by consumers — not materialised — so the refresh cost is the DT only.

Snowflake objects owned¶

Object	Type	Owner	Notes
`RISK_CUSTOMER` schema	(consumed)	MOD-039	Co-resident; `define schema` NOT redeclared (DCM v2 single-project ownership rule, MOD-035 lesson). MOD-041's `dcm/pre-dbt.sql` only does `CREATE SCHEMA IF NOT EXISTS` defensively.
`CATEGORISATION_TRAINING`	TABLE (DCM)	this module	Append-only training labels (FR-239 + bootstrap labels). DT-005 LOG — no UPDATE/DELETE/TRUNCATE grants.
`MOD_041_MODEL_VERSIONS`	TABLE (DCM)	this module	Champion/challenger lineage. Append-only. BANK_DBT_ROLE: SELECT + INSERT (the retrain task writes promotion records).
`MOD_041_EB_PUBLISH_CURSOR`	TABLE (DCM)	this module	EB publisher cursor — operational, not audit. Owner role has UPDATE; BANK_DBT_ROLE has SELECT only.
`MOD_041_MODEL_ARTEFACTS`	STAGE (DCM)	this module	Internal stage for pickled XGBoost + merchant directory snapshot. SSE-KMS. BANK_DBT_ROLE: READ.
`CATEGORISE_TRANSACTION(VARCHAR, NUMBER, NUMBER, NUMBER, VARCHAR, VARCHAR, VARCHAR)`	FUNCTION (Python — out-of-band)	this module	XGBoost UDF. CREATE OR REPLACE on each `pnpm snowflake:apply --phase=post-model`.
`NORMALISE_MERCHANT(VARCHAR, VARCHAR, VARCHAR)`	FUNCTION (Python — out-of-band)	this module	Rule + embedding UDF. Reads merchant directory snapshot pinned via IMPORTS.
`TASK_CATEGORISATION_EVENT_DRIVEN`	TASK (DCM)	this module	1-minute schedule; bumps cursor's last_run_at.
`ALERT_CATEGORISATION_MODEL_DRIFT`	ALERT (DCM, post-dbt)	this module	Routes to MOD-076 SNS alarm-intake (k-7) when daily auto_rate_pct < 85.
`CATEGORISED_TRANSACTIONS`	DYNAMIC TABLE (dbt)	this module	INCREMENTAL, target_lag = 1 minute, cluster = (customer_id, transaction_date).
`V_CATEGORISED_CURRENT`	VIEW (dbt)	this module	Published contract per ADR-046 §3 — k-8 column set.
`V_CATEGORY_SUMMARY_DAILY`	VIEW (dbt)	this module	Per-customer × category × day spend (auto-confidence rows only).
`V_CATEGORY_METRICS_DAILY`	VIEW (dbt)	this module	DT-005 monitoring source for the drift alert.
`MERCHANT_DIRECTORY`	TABLE (dbt seed)	this module	Location-level (~30 rows v1, expanding to 100–150). COPY-INTO lineage breaks the 091905 trap.
`MCC_CATEGORY_MAPPING`	TABLE (dbt seed)	this module	ISO 18245 → bank category default mapping.
`STG_MERCHANTS`	TABLE (dbt CTAS)	this module	Materialised TABLE not VIEW so the DT-feeding chain doesn't trip Snowflake-native dbt's view-over-DT constraint (MOD-035 lesson).

Out-of-band Snowflake objects (legacy `infra/snowflake/` apply)¶

Object	Reason
Python UDFs (categorise_transaction, normalise_merchant)	DCM v2 doesn't manage non-SQL UDFs (002041). Same MOD-039 score_customer pattern.
`MOD_041_MODEL_ARTEFACTS` stage (bootstrap copy)	DCM owns the canonical declaration in `dcm/sources/definitions/stage_model_artefacts.sql` but DCM deploy is gated by DCM plan, and plan fails on grants.sql referencing the Python UDFs (which need the stage to PUT artefacts to). `infra/snowflake/stages/mod_041_model_artefacts.sql` runs `CREATE STAGE IF NOT EXISTS` ahead of DCM plan via the pre-DCM-plan hook. Idempotent against subsequent DCM plans. Per bank-wiki issue #22.
(No streams)	The MOD-039 pattern uses streams; MOD-041 uses a 1-minute scheduled task instead — stream not required.

Pre-DCM-plan deploy chain (per bank-wiki issue #22)¶

scripts/pre-dcm-plan.sh runs before snow dcm plan to deploy the Python UDFs the DCM plan references. The bank-platform CI template detects this script and calls it (Part B of issue #22) instead of the generic pnpm snowflake:apply fallback. Chain:

pip install -r scripts/requirements-train.txt — numpy + xgboost + scikit-learn.
pnpm snowflake:apply --phase=pre-model — creates the bootstrap stage.
pnpm model:train — deterministic XGBoost training, writes dist/model-<v>.pkl + .model_card.json.
pnpm model:upload — PUTs both artefacts, MERGEs MODEL_VERSIONS row.
pnpm snowflake:apply --phase=post-model — deploys the two Python UDFs with MODEL_VERSION read from the model card.

Step 3+4 are idempotent on the model_version hash (same code → same artefact bytes → no-op re-deploy). The chain is also the local-dev path for forcing a fresh model: STAGE=dev bash scripts/pre-dcm-plan.sh.

SSM outputs¶

Path	Value	Consumer
`/bank/{env}/risk-platform/categorisation/categorised-current-view`	`RISK_CUSTOMER.V_CATEGORISED_CURRENT`	SD08 dashboard insight feed (FR-240); SD01 enrichment write-back (FR-238); future MOD-077 correction flow (FR-239).
`/bank/{env}/risk-platform/categorisation/category-summary-view`	`RISK_CUSTOMER.V_CATEGORY_SUMMARY_DAILY`	SD08 dashboard insight feed spend-by-category (FR-240).
`/bank/{env}/risk-platform/categorisation/publisher-lambda-arn`	Lambda ARN	Ops dashboards + CloudWatch alarms.
`/bank/{env}/risk-platform/merchant-assets/cdn-base-url`	`https://<cf-id>.cloudfront.net`	Publisher Lambda (prefixes `logo_url` / `merchant_location_map_url` paths); mobile app dashboard; back office. k-5.
`/bank/{env}/risk-platform/merchant-assets/s3-bucket`	`bank-merchant-assets-{env}`	`scripts/generate-merchant-maps.ts` data-ops tool; manual logo upload. Bucket provisioned by MOD-104 bootstrap post-2026-05-13 (GOV-005 SCP blocks `s3:CreateBucket` for `bank-platform-cicd`).

SSM consumed¶

Path	Origin
`/bank/{env}/snowflake/databases/risk`	MOD-102
`/bank/{env}/snowflake/account-locator`	MOD-102
`/bank/{env}/snowflake/warehouses/etl`	MOD-102
`/bank/{env}/snowflake/roles/domain-{nonprod\\|prod}`	MOD-102
`/bank/{env}/iam/lambda/bank-risk-platform/arn`	MOD-104
`/bank/{env}/observability/adot-layer-arn`	MOD-104
`/bank/{env}/eventbridge/bank-risk-platform/arn`	MOD-104
`/bank/{env}/eventbridge/bank-risk-platform/dlq-arn`	MOD-104
`/bank/{env}/sns/alarm-intake/arn`	MOD-076 — referenced by `alert_categorisation_model_drift` (k-7)
`/bank/{env}/kms/operational/arn`	MOD-104 — for SSE-KMS on the merchant-assets bucket

Events¶

Published¶

bank.transactions.categorised (Source: bank.risk-platform, DetailType: transactions_categorised)

Per scope rulings k-2 + k-8. JSON Schema authoritative source: docs/event-schemas/bank.transactions.categorised.v1.schema.json.

{
  // Standard envelope (ADR-051)
  "event_id": "uuid",
  "event_time": "2026-05-08T01:00:01.234Z",
  "schema_version": "1.0",
  "trace_id": "uuid",
  "idempotency_key": "sha256(posting_id|model_version)",
  // Categorisation payload
  "posting_id": "p_001",
  "customer_id": "c_001",
  "jurisdiction": "NZ",
  "merchant_name": "Caltex – Parnell",            // nullable (k-6)
  "logo_url":      "https://<cdn>/logos/caltex.png",  // nullable (k-6); CDN-prefixed
  "merchant_location_map_url": "https://<cdn>/maps/caltex-parnell.jpg",  // nullable (k-6)
  "mcc":           "5541",                        // nullable
  "category_l1":   "TRANSPORT",
  "category_l2":   "FUEL",                        // nullable (k-6)
  "categorisation_confidence": "0.9700",          // decimal-string (NUMBER(5,4))
  "model_version": "v2026-05-08-abc123"
}

Subscribers (defined in their own modules' Pulumi projects): - SD08 dashboard insight feed — FR-240 spend-by-category refresh - SD01 enrichment write-back — FR-238 populate merchant_canonical, merchant_logo_id, category on the source posting

Consumed¶

None. FR-239 customer corrections come via MOD-042 CDC, NOT EventBridge.

Operational¶

Warehouse sizing¶

The 1-minute target_lag (k-1) means the DT refreshes every minute on NONPROD_WH (dev/uat) or PROD_DBT_WH (prod). Per dbt-snowflake Dynamic Table semantics, an INCREMENTAL refresh that finds zero upstream changes is sub-second; a refresh with new rows scales with the change volume.

Rough capacity estimate at 50 transactions/minute steady-state: - Per-refresh featurise + categorise: ~50 UDF calls × ~10ms = ~500ms - Per-refresh DT MERGE: ~100ms write - Total compute time: ~30s per hour ≈ 0.04% of one X-Small warehouse-hour

Even at 10× that (peak/burst), the DT cost is negligible against the existing dbt build cost.

Drift alert response (per k-7)¶

The alert_categorisation_model_drift alert fires hourly when yesterday's auto_rate_pct < 85.00 (NFR-022 floor). When it fires:

Inspect the v_category_metrics_daily row that fired the alert. Single-day blip vs multi-day trend?
Check correction_rate_pct in the same view — drift in auto-rate often correlates with a spike in customer corrections (the leading indicator).
Look at categorisation_training for recent label distribution shift (e.g. unusual concentration in new merchant patterns or a new MCC code seen for the first time).
Trigger a manual retrain via STAGE=<env> bash scripts/pre-dcm-plan.sh (or equivalently pnpm pre-dcm-plan with STAGE exported) if the trend is sustained — don't wait for the weekly schedule. The script runs the full train + upload + post-model chain.

If the drift is caused by upstream data quality (e.g. CDC paused, MOD-042 stream behind) the alert is misleading; check v_category_metrics_daily.total_transactions first.

Curating the merchant directory¶

Add a new location (per scope ruling k-4):

Edit MOD-041-categorisation-model/seeds/merchant_directory.csv — add the new location row with lat/lng populated and map_image_url blank.
Run MAPBOX_ACCESS_TOKEN=… pnpm merchant:generate-maps --stage=dev --provider=mapbox (operator's local environment).
Review seeds/merchant_directory.patch.csv — copy the new map_image_url values back into seeds/merchant_directory.csv.
Commit. Next CI dispatch re-seeds the directory and the new location is live.

Logo uploads are manual — the merchant-curation team uploads logos/<chain>.png directly to the S3 bucket (/bank/{env}/risk-platform/merchant-assets/s3-bucket SSM points at bank-merchant-assets-{env}). Logos rarely change so no CI automation.

Why the drift alert routes to MOD-076 specifically¶

Per scope ruling k-7: same pattern as MOD-030 / MOD-065 DLQ depth alarms. MOD-076 owns the SNS alarm-intake topic that fans out to the on-call PagerDuty + Slack channels. Per-module SNS topics would fragment the on-call experience.