Skip to content

MOD-041 — Categorisation & merchant enrichment model

Module: MOD-041 (categorisation-model) System: SD06 (risk-platform) Repo: bank-risk-platform Phase: 1 ADR: ADR-017 — In-house ML categorisation model


Purpose

Snowflake-native ML pipeline that takes every customer transaction (CDC-replicated from SD01 accounts.postings and SD04 payments.payments) and produces an enriched record with normalised merchant name, location-level merchant identity, MCC, spend category L1/L2, and per-location static map image URL. Two-model architecture per ADR-017:

  • Merchant normaliser — rule engine first (regex on top-30 NZ/AU v1 location patterns; expanding to top 100–150 over time), then embedding fallback against canonical chain names for unknown descriptors.
  • Category classifier — XGBoost on tabular features (MCC, amount, hour-of-day, day-of-week, channel, country, chain token).

Confidence-routed: ≥0.85 silent auto, 0.60–0.84 nudge, <0.60 "Other". Retrained weekly via Snowflake Cortex ML on customer corrections (CDC'd back from SD01). Champion/challenger governance per DT-005.

The merchant directory is location-level (per scope ruling k-4): "Caltex Parnell" and "Caltex Takapuna" are different rows, each with their own lat/lng and pre-generated static map image. The map image gives the customer a sense-of-place on the payment-detail screen of the mobile app.


Functional + non-functional scope

Code Requirement
FR-237 ≥85% transactions categorised within 60s of CDC arrival in the Snowflake raw layer (clock starts post-CDC, NOT from payment_created in Postgres — per scope ruling k-1).
FR-238 Enrich with normalised merchant_name + MCC + logo URL (and per-location merchant_location_map_url where available — k-4); store linked to posting_id.
FR-239 Customer re-categorisation captured as labelled training signal. SD01-side write path is forward-looking; this module's stg_customer_corrections is bootstrap-resilient until that ships (k-3).
FR-240 Aggregated category spending visible to insight feed within 60s of categorisation, via bank.transactions.categorised event published by the EB-publisher Lambda (k-2).
NFR-022 Auto-rate ≥85% at 0.85 confidence, monitored via v_category_metrics_daily.auto_rate_pct and the alert_categorisation_model_drift SNS alarm.
NFR-004 Insight card refresh ≤60s — covered by FR-240 EB event timing.
NFR-013 Postgres read p99 ≤5ms — not directly binding here (no Postgres write-back; SD01 owns enrichment write-back).
Policy Mode How satisfied
CON-005 AUTO Categorised transactions surface a normalised merchant_name (not raw acquirer string) and a meaningful category_l1 — verified by tests/policy/CON-005-auto.test.ts (structural) + tests/integration/fr-237-fr-238-enrichment.test.ts (runtime).
DT-005 LOG Model versioned in MOD_041_MODEL_VERSIONS (champion/challenger lineage); retrained weekly via Cortex; performance monitored via v_category_metrics_daily and the drift alert. Append-only — tests/policy/DT-005-log.test.ts (structural) + tests/integration/snowflake-immutability.test.ts (runtime grant verification).

Architecture

Module type

Hybrid: DCM v2 declarative + dbt models + Python UDFs (out-of-band per ADR-054 §"DCM v2 limitations") + Lambda for EventBridge publication + S3/CloudFront for merchant media assets.

ADR-056 readiness (greenfield from day 1)

  • has_dbt_project: true from initial deploy — snow dbt deploy/execute runs in CI.
  • profiles.yml hybrid env_var('NAME', 'placeholder-default') form works for both Snowflake-native + legacy dbt-core paths.
  • All seed data via dbt seed CSVs (COPY-INTO lineage) — no MERGE...VALUES; the 091905 saga from MOD-035 is the precedent.
  • Per-table grants only; no GRANT ON ALL (RISK_CUSTOMER co-residency with MOD-039 + MOD-040).
  • target.warehouse on the DT (no hardcoded PROD_DBT_WH).
  • Both confidence columns (categorisation_confidence, normalisation_confidence) typed as NUMBER(5,4) not FLOAT — pre-emptive per the FLOAT-vs-NUMBER ruling on this module's scope.

Data flow

SD01 accounts.postings ──┐
                         ├─ CDC (MOD-042) → BANK_{ENV}_CORE.RAW / BANK_{ENV}_PAYMENTS.RAW
SD04 payments.payments ──┘                              │
                                  stg_postings + stg_payments (bootstrap-resilient via adapter.get_relation)
            merchant_directory  ◀──  stg_merchants   ◀──┘
            (dbt seed CSV)            (CTAS table)
                                  int_normalised_merchant   (calls NORMALISE_MERCHANT UDF)
                                  int_features              (joins directory for chain_name + logo_url + map_image_url)
                          ┌── categorised_transactions ───┐  (DT — INCREMENTAL, target_lag = 1 minute, calls CATEGORISE_TRANSACTION UDF)
                          │       (Snowflake DT)          │
                          ▼                                ▼
              v_categorised_current             v_category_summary_daily
              v_category_metrics_daily          (per-customer × category × day spend)
                          │   ↑    drift alert reads metrics view (MOD-076 SNS — k-7)
              MOD_041_EB_PUBLISH_CURSOR ◀── 1-min schedule ── EB Publisher Lambda
                                                            bank-risk-platform EventBridge bus
                                                            (bank.transactions.categorised events)
                          ┌─────────────────────────────────────────┴──┐
                          ▼                                            ▼
                    SD08 dashboard insight feed             SD01 enrichment write-back
                    (FR-240)                                (FR-238 — populates merchant_canonical, etc.)

Refresh cadence rationale (k-1)

The DT's target_lag = '1 minute' directly contradicts the SD06 wiki data model spec's 30 minutes. FR-237 requires ≥85% of transactions categorised within 60 seconds of CDC arrival; a 30-minute lag fails this by definition. This module's design doc takes precedence over the SD06 wiki text; see the MOD-041-data-model-gap.handoff.md outbound handoff for the wiki amendment request.

The cost implication of 1-minute vs 30-minute refresh is real: ~30× more frequent DT refresh on NONPROD_WH (dev) and PROD_DBT_WH (prod). Sized into the warehouse capacity plan. CTAS-style daily spend aggregation views (v_category_summary_daily) are evaluated on-demand by consumers — not materialised — so the refresh cost is the DT only.


Snowflake objects owned

Object Type Owner Notes
RISK_CUSTOMER schema (consumed) MOD-039 Co-resident; define schema NOT redeclared (DCM v2 single-project ownership rule, MOD-035 lesson). MOD-041's dcm/pre-dbt.sql only does CREATE SCHEMA IF NOT EXISTS defensively.
CATEGORISATION_TRAINING TABLE (DCM) this module Append-only training labels (FR-239 + bootstrap labels). DT-005 LOG — no UPDATE/DELETE/TRUNCATE grants.
MOD_041_MODEL_VERSIONS TABLE (DCM) this module Champion/challenger lineage. Append-only. BANK_DBT_ROLE: SELECT + INSERT (the retrain task writes promotion records).
MOD_041_EB_PUBLISH_CURSOR TABLE (DCM) this module EB publisher cursor — operational, not audit. Owner role has UPDATE; BANK_DBT_ROLE has SELECT only.
MOD_041_MODEL_ARTEFACTS STAGE (DCM) this module Internal stage for pickled XGBoost + merchant directory snapshot. SSE-KMS. BANK_DBT_ROLE: READ.
CATEGORISE_TRANSACTION(VARCHAR, NUMBER, NUMBER, NUMBER, VARCHAR, VARCHAR, VARCHAR) FUNCTION (Python — out-of-band) this module XGBoost UDF. CREATE OR REPLACE on each pnpm snowflake:apply --phase=post-model.
NORMALISE_MERCHANT(VARCHAR, VARCHAR, VARCHAR) FUNCTION (Python — out-of-band) this module Rule + embedding UDF. Reads merchant directory snapshot pinned via IMPORTS.
TASK_CATEGORISATION_EVENT_DRIVEN TASK (DCM) this module 1-minute schedule; bumps cursor's last_run_at.
ALERT_CATEGORISATION_MODEL_DRIFT ALERT (DCM, post-dbt) this module Routes to MOD-076 SNS alarm-intake (k-7) when daily auto_rate_pct < 85.
CATEGORISED_TRANSACTIONS DYNAMIC TABLE (dbt) this module INCREMENTAL, target_lag = 1 minute, cluster = (customer_id, transaction_date).
V_CATEGORISED_CURRENT VIEW (dbt) this module Published contract per ADR-046 §3 — k-8 column set.
V_CATEGORY_SUMMARY_DAILY VIEW (dbt) this module Per-customer × category × day spend (auto-confidence rows only).
V_CATEGORY_METRICS_DAILY VIEW (dbt) this module DT-005 monitoring source for the drift alert.
MERCHANT_DIRECTORY TABLE (dbt seed) this module Location-level (~30 rows v1, expanding to 100–150). COPY-INTO lineage breaks the 091905 trap.
MCC_CATEGORY_MAPPING TABLE (dbt seed) this module ISO 18245 → bank category default mapping.
STG_MERCHANTS TABLE (dbt CTAS) this module Materialised TABLE not VIEW so the DT-feeding chain doesn't trip Snowflake-native dbt's view-over-DT constraint (MOD-035 lesson).

Out-of-band Snowflake objects (legacy infra/snowflake/ apply)

Object Reason
Python UDFs (categorise_transaction, normalise_merchant) DCM v2 doesn't manage non-SQL UDFs (002041). Same MOD-039 score_customer pattern.
MOD_041_MODEL_ARTEFACTS stage (bootstrap copy) DCM owns the canonical declaration in dcm/sources/definitions/stage_model_artefacts.sql but DCM deploy is gated by DCM plan, and plan fails on grants.sql referencing the Python UDFs (which need the stage to PUT artefacts to). infra/snowflake/stages/mod_041_model_artefacts.sql runs CREATE STAGE IF NOT EXISTS ahead of DCM plan via the pre-DCM-plan hook. Idempotent against subsequent DCM plans. Per bank-wiki issue #22.
(No streams) The MOD-039 pattern uses streams; MOD-041 uses a 1-minute scheduled task instead — stream not required.

Pre-DCM-plan deploy chain (per bank-wiki issue #22)

scripts/pre-dcm-plan.sh runs before snow dcm plan to deploy the Python UDFs the DCM plan references. The bank-platform CI template detects this script and calls it (Part B of issue #22) instead of the generic pnpm snowflake:apply fallback. Chain:

  1. pip install -r scripts/requirements-train.txt — numpy + xgboost + scikit-learn.
  2. pnpm snowflake:apply --phase=pre-model — creates the bootstrap stage.
  3. pnpm model:train — deterministic XGBoost training, writes dist/model-<v>.pkl + .model_card.json.
  4. pnpm model:upload — PUTs both artefacts, MERGEs MODEL_VERSIONS row.
  5. pnpm snowflake:apply --phase=post-model — deploys the two Python UDFs with MODEL_VERSION read from the model card.

Step 3+4 are idempotent on the model_version hash (same code → same artefact bytes → no-op re-deploy). The chain is also the local-dev path for forcing a fresh model: STAGE=dev bash scripts/pre-dcm-plan.sh.


SSM outputs

Path Value Consumer
/bank/{env}/risk-platform/categorisation/categorised-current-view RISK_CUSTOMER.V_CATEGORISED_CURRENT SD08 dashboard insight feed (FR-240); SD01 enrichment write-back (FR-238); future MOD-077 correction flow (FR-239).
/bank/{env}/risk-platform/categorisation/category-summary-view RISK_CUSTOMER.V_CATEGORY_SUMMARY_DAILY SD08 dashboard insight feed spend-by-category (FR-240).
/bank/{env}/risk-platform/categorisation/publisher-lambda-arn Lambda ARN Ops dashboards + CloudWatch alarms.
/bank/{env}/risk-platform/merchant-assets/cdn-base-url https://<cf-id>.cloudfront.net Publisher Lambda (prefixes logo_url / merchant_location_map_url paths); mobile app dashboard; back office. k-5.
/bank/{env}/risk-platform/merchant-assets/s3-bucket bank-merchant-assets-{env} scripts/generate-merchant-maps.ts data-ops tool; manual logo upload. Bucket provisioned by MOD-104 bootstrap post-2026-05-13 (GOV-005 SCP blocks s3:CreateBucket for bank-platform-cicd).

SSM consumed

Path Origin
/bank/{env}/snowflake/databases/risk MOD-102
/bank/{env}/snowflake/account-locator MOD-102
/bank/{env}/snowflake/warehouses/etl MOD-102
/bank/{env}/snowflake/roles/domain-{nonprod\|prod} MOD-102
/bank/{env}/iam/lambda/bank-risk-platform/arn MOD-104
/bank/{env}/observability/adot-layer-arn MOD-104
/bank/{env}/eventbridge/bank-risk-platform/arn MOD-104
/bank/{env}/eventbridge/bank-risk-platform/dlq-arn MOD-104
/bank/{env}/sns/alarm-intake/arn MOD-076 — referenced by alert_categorisation_model_drift (k-7)
/bank/{env}/kms/operational/arn MOD-104 — for SSE-KMS on the merchant-assets bucket

Events

Published

bank.transactions.categorised (Source: bank.risk-platform, DetailType: transactions_categorised)

Per scope rulings k-2 + k-8. JSON Schema authoritative source: docs/event-schemas/bank.transactions.categorised.v1.schema.json.

{
  // Standard envelope (ADR-051)
  "event_id": "uuid",
  "event_time": "2026-05-08T01:00:01.234Z",
  "schema_version": "1.0",
  "trace_id": "uuid",
  "idempotency_key": "sha256(posting_id|model_version)",
  // Categorisation payload
  "posting_id": "p_001",
  "customer_id": "c_001",
  "jurisdiction": "NZ",
  "merchant_name": "Caltex – Parnell",            // nullable (k-6)
  "logo_url":      "https://<cdn>/logos/caltex.png",  // nullable (k-6); CDN-prefixed
  "merchant_location_map_url": "https://<cdn>/maps/caltex-parnell.jpg",  // nullable (k-6)
  "mcc":           "5541",                        // nullable
  "category_l1":   "TRANSPORT",
  "category_l2":   "FUEL",                        // nullable (k-6)
  "categorisation_confidence": "0.9700",          // decimal-string (NUMBER(5,4))
  "model_version": "v2026-05-08-abc123"
}

Subscribers (defined in their own modules' Pulumi projects): - SD08 dashboard insight feed — FR-240 spend-by-category refresh - SD01 enrichment write-back — FR-238 populate merchant_canonical, merchant_logo_id, category on the source posting

Consumed

None. FR-239 customer corrections come via MOD-042 CDC, NOT EventBridge.


Operational

Warehouse sizing

The 1-minute target_lag (k-1) means the DT refreshes every minute on NONPROD_WH (dev/uat) or PROD_DBT_WH (prod). Per dbt-snowflake Dynamic Table semantics, an INCREMENTAL refresh that finds zero upstream changes is sub-second; a refresh with new rows scales with the change volume.

Rough capacity estimate at 50 transactions/minute steady-state: - Per-refresh featurise + categorise: ~50 UDF calls × ~10ms = ~500ms - Per-refresh DT MERGE: ~100ms write - Total compute time: ~30s per hour ≈ 0.04% of one X-Small warehouse-hour

Even at 10× that (peak/burst), the DT cost is negligible against the existing dbt build cost.

Drift alert response (per k-7)

The alert_categorisation_model_drift alert fires hourly when yesterday's auto_rate_pct < 85.00 (NFR-022 floor). When it fires:

  1. Inspect the v_category_metrics_daily row that fired the alert. Single-day blip vs multi-day trend?
  2. Check correction_rate_pct in the same view — drift in auto-rate often correlates with a spike in customer corrections (the leading indicator).
  3. Look at categorisation_training for recent label distribution shift (e.g. unusual concentration in new merchant patterns or a new MCC code seen for the first time).
  4. Trigger a manual retrain via STAGE=<env> bash scripts/pre-dcm-plan.sh (or equivalently pnpm pre-dcm-plan with STAGE exported) if the trend is sustained — don't wait for the weekly schedule. The script runs the full train + upload + post-model chain.

If the drift is caused by upstream data quality (e.g. CDC paused, MOD-042 stream behind) the alert is misleading; check v_category_metrics_daily.total_transactions first.

Curating the merchant directory

Add a new location (per scope ruling k-4):

  1. Edit MOD-041-categorisation-model/seeds/merchant_directory.csv — add the new location row with lat/lng populated and map_image_url blank.
  2. Run MAPBOX_ACCESS_TOKEN=… pnpm merchant:generate-maps --stage=dev --provider=mapbox (operator's local environment).
  3. Review seeds/merchant_directory.patch.csv — copy the new map_image_url values back into seeds/merchant_directory.csv.
  4. Commit. Next CI dispatch re-seeds the directory and the new location is live.

Logo uploads are manual — the merchant-curation team uploads logos/<chain>.png directly to the S3 bucket (/bank/{env}/risk-platform/merchant-assets/s3-bucket SSM points at bank-merchant-assets-{env}). Logos rarely change so no CI automation.

Why the drift alert routes to MOD-076 specifically

Per scope ruling k-7: same pattern as MOD-030 / MOD-065 DLQ depth alarms. MOD-076 owns the SNS alarm-intake topic that fans out to the on-call PagerDuty + Slack channels. Per-module SNS topics would fragment the on-call experience.