Skip to content

Alert thresholds

Resolves: GAP-D09 — No alert threshold specification.

This document specifies all alert thresholds for the platform observability layer (MOD-076). MOD-076 is not yet built, but thresholds are specified here so the module can be configured correctly at build time and so on-call engineers understand what each alert means before it fires.

All alerts route via SNS to the on-call notification channel. P1 alerts also page the on-call engineer via PagerDuty. P1 alerts with regulatory implications additionally notify the Chief Compliance Officer.

Related: post-deployment checklist · module activation matrix


Severity and escalation

Severity Response time Who Required action
P1 15 minutes On-call engineer (paged) + Head of Technology Immediate investigation. If unresolved after 1 hour: regulator notification if the alert has regulatory implications (see note below). Incident record opened immediately.
P2 1 hour On-call engineer (paged) Investigation and resolution plan within 1 hour. Incident record opened.
P3 Next business day Engineering team Ticket created and triaged. No paging.

Regulatory notifications: P1 alerts marked [REG] below also trigger a compliance notification to the Chief Compliance Officer at the time of alert. If the incident is not resolved within 1 hour, the compliance team initiates regulator contact procedures per the incident response policy.

Auto-mitigation: Where Lambda auto-retry or other automated responses are configured, the alert still fires at the threshold — mitigation is supplementary, not a substitute for investigation.


Latency alerts

ALERT-LATENCY-001 — Customer-facing API p99 latency

Field Value
Metric source AWS/ApiGateway namespace, metric IntegrationLatency, dimension ApiName=bank-customer-api
Threshold p99 > 1000ms
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour
Notes NFR target is ≤ 1000ms p99 for customer-facing APIs. Sustained breach indicates Lambda cold-start accumulation, database connection pool exhaustion, or upstream provider degradation.

ALERT-LATENCY-002 — Internal service call p99 latency

Field Value
Metric source bank/platform/latency custom namespace, metric InternalApiP99, dimension per system domain
Threshold p99 > 200ms
Evaluation period Average over 5 minutes
Severity P3
Auto-mitigation None
Escalation P3: ticket created next business day
Notes NFR target is ≤ 200ms p99 for internal service calls. P3 rather than P2 because customers are not directly impacted; the breach will typically surface as a P2 latency alert downstream before causing customer impact.

ALERT-LATENCY-003 — Payment processing p99 latency

Field Value
Metric source bank/payments/latency custom namespace, metric PaymentProcessingP99
Threshold p99 > 500ms
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour
Notes NFR target is ≤ 500ms p99 for payment processing. Payments are a critical customer-facing path; a sustained p99 breach indicates a systemic issue.

ALERT-LATENCY-004 — Core ledger posting p99 latency

Field Value
Metric source bank/core/latency custom namespace, metric LedgerPostingP99
Threshold p99 > 200ms
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour
Notes The ledger is the system of record for all financial movements. Posting latency above 200ms propagates to every downstream system. Investigate Neon connection pool and Lambda concurrency first.

Error rate alerts

ALERT-ERROR-001 — Lambda error rate

Field Value
Metric source AWS/Lambda namespace, metric Errors and Invocations, per function
Threshold Error rate > 1% (Errors / Invocations)
Evaluation period Sustained over 5 minutes
Severity P2
Auto-mitigation Lambda retries on async invocations (2 retries with backoff). Dead-letter queue captures events that exhaust retries.
Escalation P2: on-call engineer paged within 1 hour
Notes This alert fires per Lambda function. A single function exceeding 1% error rate triggers the alert. Identify the function from the CloudWatch dimension and investigate its logs in X-Ray.

ALERT-ERROR-002 — Payment failure rate [REG]

Field Value
Metric source bank/payments/errors custom namespace, metric PaymentFailureRate
Threshold Failure rate > 0.5%
Evaluation period Over 15 minutes
Severity P1
Auto-mitigation None
Escalation P1: on-call engineer and Head of Technology paged within 15 minutes. Compliance team notified.
Notes Payments are a critical path. A failure rate above 0.5% over 15 minutes indicates a systemic issue, not isolated retries. This may have obligations under the PSPA (NZ) or PSA (AU) if customers cannot make payments.

ALERT-ERROR-003 — KYC verification service error rate

Field Value
Metric source bank/kyc/errors custom namespace, metric VerificationErrorRate
Threshold Error rate > 5%
Evaluation period Over 5 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour
Notes A high eIDV error rate typically indicates provider API degradation rather than a platform bug. Check eIDV provider status page and Secrets Manager API key validity. At > 5% error rate, customer onboarding is effectively impaired.

ALERT-ERROR-004 — Authentication error rate

Field Value
Metric source AWS/Cognito namespace, metric SignInSuccesses and SignInFailures per user pool
Threshold Authentication error rate > 2%
Evaluation period Over 5 minutes
Severity P2
Auto-mitigation Cognito account lockout after 5 consecutive failures per user.
Escalation P2: on-call engineer paged within 1 hour. If the pattern suggests a credential stuffing or brute-force attack, escalate to P1 and notify security team.
Notes 2% authentication failure rate may indicate a legitimate service degradation or an authentication attack. Correlate with ALERT-SEC-001 (failed auth rate per IP) to distinguish.

Infrastructure alerts

ALERT-INFRA-001 — EventBridge DLQ depth

Field Value
Metric source AWS/SQS namespace, metric ApproximateNumberOfMessagesVisible, per DLQ queue
Threshold Any depth > 0
Evaluation period Any single datapoint
Severity P2
Auto-mitigation None. Events in DLQ require manual investigation and replay.
Escalation P2: on-call engineer paged within 1 hour
Notes Any event reaching a DLQ represents a delivery failure. Inspect the DLQ message to identify which Lambda target failed and why. Do not delete DLQ messages without understanding the root cause — they may be required for audit.

ALERT-INFRA-002 — Neon connection pool utilisation

Field Value
Metric source bank/platform/database custom namespace, metric PgBouncerPoolUtilisation, dimension per database
Threshold Pool utilisation > 80%
Evaluation period Sustained over 10 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour
Notes At > 80% pool utilisation, new connections will queue and latency will degrade. Investigate Lambda concurrency growth and whether connection acquisition timeouts are occurring. May require scaling the PgBouncer pool configuration.

ALERT-INFRA-003 — S3 Glacier retrieval failure

Field Value
Metric source AWS/S3 namespace, metric GetRequests 4xx/5xx errors for Glacier storage class
Threshold Any retrieval failure
Evaluation period Any single datapoint
Severity P3
Auto-mitigation None
Escalation P3: ticket created next business day
Notes Glacier failures are typically for archival data retrieval (audit logs, historical statements). Not time-critical unless related to a regulatory request, in which case escalate to P2.

ALERT-INFRA-004 — Lambda concurrency approaching limit

Field Value
Metric source AWS/Lambda namespace, metric ConcurrentExecutions vs reserved concurrency per function group
Threshold Concurrent executions > 80% of reserved concurrency
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation None. Lambda will throttle beyond the reserved concurrency limit.
Escalation P2: on-call engineer paged within 1 hour
Notes Approaching the concurrency limit means throttling is imminent, which will cause 429 errors for customers. Investigate the traffic pattern driving the spike and consider whether reserved concurrency needs to be increased for the affected function group.

Financial integrity alerts

ALERT-FIN-001 — Balance reconciliation discrepancy [REG]

Field Value
Metric source bank/core/reconciliation custom namespace, metric DiscrepancyCount
Threshold Any discrepancy (> 0)
Evaluation period Any single datapoint from the reconciliation engine
Severity P1
Auto-mitigation None. Any discrepancy must be investigated by a human.
Escalation P1 immediate: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time. Incident record opened.
Notes Zero-tolerance NFR. A balance discrepancy means the double-entry invariant has been violated. This is a regulatory event and may require notification to the RBNZ or APRA depending on magnitude and cause. Do not clear the alert without a root-cause explanation signed off by engineering and compliance.

ALERT-FIN-002 — AML engine not receiving posting events [REG]

Field Value
Metric source bank/aml/event-lag custom namespace, metric PostingToAmlEventLagSeconds
Threshold Any posting not received by AML engine within 5 minutes of ledger commit
Evaluation period Calculated per posting; alert fires on first miss
Severity P1
Auto-mitigation None
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified.
Notes AML monitoring must screen all transactions. A gap in event delivery means transactions are clearing without AML review — a direct AML/CFT compliance breach. The gap must be closed and missed events replayed before the alert can be cleared.

ALERT-FIN-003 — CDC pipeline lag

Field Value
Metric source AWS/KinesisFirehose namespace, metric DeliveryToS3.DataFreshness for the CDC delivery stream
Threshold Lag > 15 minutes
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation Kinesis Firehose retries delivery automatically.
Escalation P2: on-call engineer paged within 1 hour
Notes CDC lag above 15 minutes means regulatory reporting and Snowflake data are stale. If the lag persists and a regulatory report is due, escalate to P1.

Regulatory and compliance alerts

ALERT-COMP-001 — Sanctions list stale [REG]

Field Value
Metric source bank/aml/sanctions custom namespace, metric SanctionsListAgeHours
Threshold > 25 hours since last successful refresh
Evaluation period Any single datapoint
Severity P1
Auto-mitigation Sanctions list refresh runs daily. If the scheduled refresh fails, the module will retry up to 3 times before the alert fires.
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time.
Notes The 25-hour threshold gives 1 hour of slack over the 24-hour refresh schedule. A stale sanctions list means customer screening may be operating against an outdated dataset — a direct AML/CFT compliance risk. Manual refresh must be triggered immediately.

ALERT-COMP-002 — AML alert queue backlog

Field Value
Metric source bank/aml/cases custom namespace, metric AlertQueueDepth
Threshold Queue depth > 100
Evaluation period Average over 5 minutes
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour. Compliance team also notified (not just engineering).
Notes A queue of > 100 unreviewed AML alerts indicates either a spike in suspicious activity or a processing backlog. Either way, the compliance team needs to be aware. The on-call engineer investigates whether the queue depth is a processing failure; the compliance team reviews the alert content.

ALERT-COMP-003 — Regulatory report not submitted within SLA [REG]

Field Value
Metric source bank/reporting/schedule custom namespace, metric ReportOverdueSLA per report
Threshold Any report exceeds its SLA deadline
Evaluation period Any single datapoint
Severity P1
Auto-mitigation None
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified immediately — they own the regulator relationship.
Notes SLA deadlines are hardcoded per report in the reporting module configuration. Late submission to a regulator (RBNZ, APRA, AUSTRAC, FIU) is a notifiable breach under the applicable legislation.

ALERT-COMP-004 — KYC gate bypass attempt [REG]

Field Value
Metric source bank/kyc/security custom namespace, metric GateBypassAttempts
Threshold Any detected bypass attempt (> 0)
Evaluation period Any single datapoint
Severity P1
Auto-mitigation The bypass attempt is rejected by the gate. The alert is raised for investigation regardless.
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer and security team notified.
Notes A bypass attempt means something or someone attempted to open an account or perform a KYC-gated action without completing identity verification. This could be a misconfigured client, a code regression, or a deliberate circumvention attempt. Treat as a security incident until proven otherwise.

Security alerts

ALERT-SEC-001 — Brute-force authentication indicator

Field Value
Metric source AWS/WAF namespace, metric BlockedRequests per rule RateLimit-AuthEndpoint, or custom bank/auth/security metric FailedAuthPerIP
Threshold > 10 failed authentication attempts per minute from a single IP
Evaluation period Over 1 minute
Severity P1
Auto-mitigation WAF rate-limit rule blocks the IP after threshold breach. Cognito account lockout applies per-user.
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified.
Notes 10 failures per minute per IP is a conservative threshold that filters out legitimate retry loops. Investigate whether the IP is a known proxy or Tor exit node. Consider blocking at the WAF level if the pattern persists.

ALERT-SEC-002 — Cognito admin API call outside change window

Field Value
Metric source CloudTrail cognito-idp.amazonaws.com events, eventName in [AdminCreateUser, AdminDeleteUser, AdminSetUserPassword, AdminUpdateUserPool, UpdateUserPool], evaluated against approved change window schedule
Threshold Any qualifying API call outside the approved change window
Evaluation period Any single event
Severity P2
Auto-mitigation None
Escalation P2: on-call engineer paged within 1 hour. Security team notified.
Notes Cognito admin changes outside a change window may indicate an unauthorised modification to the authentication configuration. Changes during a change window should be correlated against the approved change ticket.

ALERT-SEC-003 — IAM role assumption from unexpected principal

Field Value
Metric source CloudTrail sts.amazonaws.com events, eventName=AssumeRole, cross-referenced against approved principals list
Threshold Any AssumeRole event where the calling principal is not in the approved list
Evaluation period Any single event
Severity P1
Auto-mitigation None. STS does not block the call — this alert is detective, not preventive.
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified. Treat as a potential credential compromise until proven otherwise.
Notes Maintain the approved principals list in the MOD-076 configuration. Review the list during every deployment to ensure decommissioned services are removed.

ALERT-SEC-004 — CloudTrail logging gap [REG]

Field Value
Metric source CloudTrail metric filter on trail delivery, or AWS/CloudTrail metric EventsDeliveredToS3
Threshold No CloudTrail events delivered for > 15 minutes
Evaluation period Over 15 minutes
Severity P1
Auto-mitigation None
Escalation P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified.
Notes A CloudTrail gap means API activity during that window is not auditable. This is a compliance and forensic integrity issue. RBNZ and APRA both require continuous audit trail availability. Investigate whether the trail is disabled, the S3 bucket is full, or there is a KMS key issue preventing log encryption.

Alert code reference

Code Name Severity
ALERT-LATENCY-001 Customer-facing API p99 latency P2
ALERT-LATENCY-002 Internal service call p99 latency P3
ALERT-LATENCY-003 Payment processing p99 latency P2
ALERT-LATENCY-004 Core ledger posting p99 latency P2
ALERT-ERROR-001 Lambda error rate P2
ALERT-ERROR-002 Payment failure rate [REG] P1
ALERT-ERROR-003 KYC verification service error rate P2
ALERT-ERROR-004 Authentication error rate P2
ALERT-INFRA-001 EventBridge DLQ depth P2
ALERT-INFRA-002 Neon connection pool utilisation P2
ALERT-INFRA-003 S3 Glacier retrieval failure P3
ALERT-INFRA-004 Lambda concurrency approaching limit P2
ALERT-FIN-001 Balance reconciliation discrepancy [REG] P1
ALERT-FIN-002 AML engine not receiving posting events [REG] P1
ALERT-FIN-003 CDC pipeline lag P2
ALERT-COMP-001 Sanctions list stale [REG] P1
ALERT-COMP-002 AML alert queue backlog P2
ALERT-COMP-003 Regulatory report not submitted [REG] P1
ALERT-COMP-004 KYC gate bypass attempt [REG] P1
ALERT-SEC-001 Brute-force authentication indicator P1
ALERT-SEC-002 Cognito admin API call outside change window P2
ALERT-SEC-003 IAM role assumption from unexpected principal P1
ALERT-SEC-004 CloudTrail logging gap [REG] P1

[REG] = Regulatory implication. Chief Compliance Officer notified at alert time. Regulator contact initiated if unresolved after 1 hour.