Skip to content

ADR-017: Transaction categorisation and merchant enrichment — in-house ML vs external API

Status Accepted
Date 2026-04-10
Deciders CTO, Head of Data, Head of Product
Affects repos bank-risk-platform, bank-core

Status

Accepted — 2026-04-10

Context

Every transaction needs a clean merchant name and a spend category to be useful to customers. The raw acquirer data ("WLWRTHS ST 0442 AUCKLAND NZ", "MKDNLDS #0841 AU VISA") is meaningless without enrichment. The question is whether to use an external enrichment API (Ntropy, Mastercard Small Business Insights, Plaid enrichment) or to build an in-house ML model in Snowflake.

This is a genuine learning AI use case. The categorisation model improves every time a customer corrects a category — the correction is a labelled training example. Over time, the bank accumulates proprietary transaction data that external providers cannot replicate. The model becomes a data asset.


Options evaluated

Option A — External enrichment API

Call a third-party API on each transaction to get a clean merchant name, logo, and category. Providers include Ntropy, Mastercard Small Business Insights, Plaid Transactions (enrichment), or equivalent.

Strengths: Pre-trained on broad transaction data; fast to implement; handles NZ/AU merchant names from day one without cold start.

Weaknesses: Per-transaction API cost that scales with volume; data shared with a third party; no proprietary advantage; categories defined by the vendor, not the bank; model does not learn from customer behaviour.

Option B — In-house ML model in Snowflake

Two models: a merchant normaliser that maps raw acquirer strings to canonical merchant names, and a category classifier that assigns spend categories. Both trained in Snowflake Cortex, retrained weekly on customer correction signals.

Strengths: Proprietary data advantage that compounds over time; no per-transaction API cost at scale; model learns from NZ/AU customer behaviour; full control over category taxonomy; customer corrections feed the model; data stays in the bank.

Weaknesses: Cold start at launch — no NZ/AU transaction history. Must be seeded. Weekly retraining cycle (not hourly).

Start with an external API for the initial period while transaction data accumulates. Switch to the in-house model when the training set is large enough to outperform the external provider.


The two-model architecture (Option B / C)

The categorisation system is built as two distinct models with different problems to solve:

Model 1: Merchant normaliser

Problem type: String → canonical entity lookup.

Approach: Rule engine first (regex patterns for known major merchants), then embedding similarity search against a merchant lookup table for unknowns. The rule engine handles 80% of transactions instantly. The embedding model handles the long tail.

Examples: - "WLWRTHS ST 0442 AUCKLAND NZ" → Rule match → "Woolworths NZ" (confidence: 1.0) - "MKDNLDS #0841 AU VISA" → Fuzzy match → "McDonald's Australia" (confidence: 0.94) - "BRWN BREAD CAFE PTY MELB" → Embedding match → "Brown Bread Cafe" (confidence: 0.81)

Model 2: Category classifier

Problem type: Multi-class classification into ~25 spend categories.

Model type: XGBoost gradient-boosted classifier on tabular features. Fast, explainable, works well on tabular transaction data. Interpretable — a given category assignment can be traced to which features drove it.

Feature vector per transaction: - MCC code (most predictive single feature) - Transaction amount - Hour of day - Day of week - Channel (contactless, chip, online) - Country - Merchant token (from Model 1 output)

Examples: - MCC 5411, NZD 47.30, Saturday morning, contactless, NZ, "woolworths" → Groceries (confidence: 0.97) - MCC 5812, AUD 142.00, Friday evening, chip, AU, "restaurant" → Dining out (confidence: 0.91) - MCC 7399, NZD 890.00, Tuesday afternoon, online, NZ, unknown → Business services? (confidence: 0.61 — show "Other")


Confidence routing

Rather than apply every categorisation automatically, the model confidence score drives the customer experience:

Confidence Handling
≥ 0.85 Category applied silently. No customer prompt.
0.60–0.84 Best guess applied. "Does this look right?" nudge shown.
< 0.60 Transaction shown as "Other". Customer prompted to categorise. Becomes a training label.

This design serves two purposes: it gives customers accurate categories without friction, and it generates high-quality training labels for the cases where the model is uncertain. Low-confidence transactions that customers categorise become the most valuable training data — they are the edge cases where the model needs to improve.


Training pipeline

Bootstrap (pre-launch)

The cold start problem is solvable before the first customer transaction is processed. The preferred approach is to arrive at go-live with a trained model, not a blank one. Seed the training set from the following sources, in priority order:

  1. Pre-launch imported transaction data: Historical NZ/AU transaction data imported and labelled prior to go-live. This is the highest-value bootstrap source — real transactions from real NZ/AU merchants, labelled with categories, loaded into the Snowflake training table before the production model is trained. Sources may include: anonymised synthetic data derived from public transaction datasets, historical data from founding team members' accounts (with consent), or a structured manual import exercise run during UAT. The volume needed to train a usable day-one model is achievable without live customers — 50,000–100,000 labelled NZD/AUD transactions is a realistic pre-launch target.

  2. MCC code → category mapping: ~300 MCC codes map directly to categories. MCC 5411 is always groceries. This gives deterministic coverage for the most common merchants regardless of training data volume.

  3. Manual curation of top-500 NZ/AU merchants: Hand-label the most common merchants before launch. Woolworths, Countdown, Pak'nSave, Coles, McDonald's, BP, ANZ ATM etc. This is fast to produce and covers the majority of day-one transaction volume.

  4. Open banking import labels (AU, CDR): Customers who connect via CDR bring their existing transaction history with categories. This continues to seed the model with real labelled examples post-launch.

Continuous (ongoing)

Every customer correction writes a labelled record to a training table in Snowflake: raw transaction features + the corrected category as ground truth. Agent overrides in back office are weighted more heavily (higher label quality). Low-confidence predictions that the customer confirms become positive labels.

After 30 days at reasonable customer volumes, there will be tens of thousands of labelled examples across NZD and AUD transactions.

Weekly retrain

Snowflake Cortex ML handles the retrain as a scheduled task — no separate MLOps infrastructure. The training job pulls the full labelled dataset, trains a new XGBoost classifier, evaluates it against a held-out validation set, and registers the new model version.

Champion / challenger governance

New model versions are not automatically promoted. The challenger is evaluated against the current champion on: - Accuracy on held-out test set - Auto-categorisation rate (proportion above 0.85 confidence) - Customer correction rate on a shadow-scored sample

Promotion only if the challenger beats the champion on all three metrics. Failed challengers are logged; the current champion continues. This is the model governance standard required by policy DT-005.


Model performance targets (steady state)

Metric Target
Accuracy (correct category) ≥92%
Auto-categorisation rate ≥87% of transactions above 0.85 threshold
Customer correction rate ≤4% of categorised transactions corrected
Enrichment latency (p99) ≤5 seconds from transaction event to enriched record

Recommendation

In-house model from day one, with an external API as a conditional fallback only if the pre-launch bootstrap is insufficient.

The preferred path is to arrive at go-live with a trained in-house model, having completed the pre-launch import and labelling exercise during the UAT phase. If the bootstrap dataset achieves the minimum training volume (50,000–100,000 labelled transactions) and the day-one model meets the accuracy targets in the table above on the held-out validation set, no external API is needed at launch.

If the pre-launch bootstrap does not produce a model that meets the accuracy threshold — or if the auto-categorisation rate is materially below target on day one — an external enrichment API (Ntropy or equivalent) may be activated as a time-limited fallback for unknown merchants the in-house model cannot classify with confidence ≥ 0.60. This is an operational decision made at go/no-go review, not an architectural assumption.

Scenario Approach
Bootstrap achieves accuracy target In-house model only from launch
Bootstrap falls short External API fallback for low-confidence transactions; in-house model promoted as soon as accuracy target is met
Long term (all scenarios) In-house model only — proprietary NZ/AU data asset

The enrichment pipeline is designed as a pluggable consumer so the data source can be switched without changing the transaction schema or the downstream write-back. Activating or deactivating the external API fallback is a configuration change, not a deployment.


Principles alignment

Principle Assessment Notes
AP-001 KISS Two models, clear separation of concerns, Snowflake Cortex removes MLOps overhead
AP-006 Cost effective Per-transaction API cost eliminated at scale; Snowflake compute is the only cost
AP-007 Evolution Model improves continuously; data advantage compounds over time

Perspectives

Perspective Assessment Notes
Strategy Proprietary NZ/AU model is a long-term data moat
Evolution Improves with every customer correction
Cost Removes per-transaction API cost at scale
Capability Confidence routing gives better customer experience than binary categories
Resource ~ Requires ML engineering capability to build and maintain

See perspectives.md for how to use these evaluation lenses.


Relevant viewpoints

  • System viewpoint — Enrichment pipeline: EventBridge event (bank.transactions.authorised) → enrichment Lambda → Model 1 + Model 2 → write-back to Postgres
  • Information viewpoint — Training table schema; model version registry; merchant lookup table structure
  • Operational viewpoint — Model performance monitoring; retraining cadence; champion/challenger evaluation dashboard

See viewpoints.md for guidance on producing these viewpoints.



Signoff record

Date Name Role Status
2026-04-10 Ross Millen CTO Approved
2026-04-10 Ross Millen Head of Architecture Approved
2026-04-10 Ross Millen Head of Data Approved

Capabilities

Capability Description Relationship
CAP-012 Merchant name enrichment & logo enabled — in-house merchant normaliser maps raw acquirer strings to canonical names
CAP-013 Spend categorisation (auto + manual) enabled — XGBoost category classifier with confidence routing and correction loop
CAP-019 Monthly spending summary & trends enabled — category accuracy is a prerequisite for meaningful spend analytics

ADR Title Relationship
ADR-002 Snowflake as the analytics and risk compute platform Snowflake Cortex is the ML training and serving environment
ADR-016 Transaction history design — data model, enrichment, and UX enrichment pipeline architecture defined in ADR-016; ML decision is this ADR

All ADRs Compiled 2026-05-22 from source/entities/adrs/ADR-017.yaml