Alert thresholds
Resolves: GAP-D09 — No alert threshold specification.
This document specifies all alert thresholds for the platform observability layer (MOD-076). MOD-076 is not yet built, but thresholds are specified here so the module can be configured correctly at build time and so on-call engineers understand what each alert means before it fires.
All alerts route via SNS to the on-call notification channel. P1 alerts also page the on-call engineer via PagerDuty. P1 alerts with regulatory implications additionally notify the Chief Compliance Officer.
Related: post-deployment checklist · module activation matrix
Severity and escalation
| Severity |
Response time |
Who |
Required action |
| P1 |
15 minutes |
On-call engineer (paged) + Head of Technology |
Immediate investigation. If unresolved after 1 hour: regulator notification if the alert has regulatory implications (see note below). Incident record opened immediately. |
| P2 |
1 hour |
On-call engineer (paged) |
Investigation and resolution plan within 1 hour. Incident record opened. |
| P3 |
Next business day |
Engineering team |
Ticket created and triaged. No paging. |
Regulatory notifications: P1 alerts marked [REG] below also trigger a compliance notification to the Chief Compliance Officer at the time of alert. If the incident is not resolved within 1 hour, the compliance team initiates regulator contact procedures per the incident response policy.
Auto-mitigation: Where Lambda auto-retry or other automated responses are configured, the alert still fires at the threshold — mitigation is supplementary, not a substitute for investigation.
Latency alerts
ALERT-LATENCY-001 — Customer-facing API p99 latency
| Field |
Value |
| Metric source |
AWS/ApiGateway namespace, metric IntegrationLatency, dimension ApiName=bank-customer-api |
| Threshold |
p99 > 1000ms |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
NFR target is ≤ 1000ms p99 for customer-facing APIs. Sustained breach indicates Lambda cold-start accumulation, database connection pool exhaustion, or upstream provider degradation. |
ALERT-LATENCY-002 — Internal service call p99 latency
| Field |
Value |
| Metric source |
bank/platform/latency custom namespace, metric InternalApiP99, dimension per system domain |
| Threshold |
p99 > 200ms |
| Evaluation period |
Average over 5 minutes |
| Severity |
P3 |
| Auto-mitigation |
None |
| Escalation |
P3: ticket created next business day |
| Notes |
NFR target is ≤ 200ms p99 for internal service calls. P3 rather than P2 because customers are not directly impacted; the breach will typically surface as a P2 latency alert downstream before causing customer impact. |
ALERT-LATENCY-003 — Payment processing p99 latency
| Field |
Value |
| Metric source |
bank/payments/latency custom namespace, metric PaymentProcessingP99 |
| Threshold |
p99 > 500ms |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
NFR target is ≤ 500ms p99 for payment processing. Payments are a critical customer-facing path; a sustained p99 breach indicates a systemic issue. |
ALERT-LATENCY-004 — Core ledger posting p99 latency
| Field |
Value |
| Metric source |
bank/core/latency custom namespace, metric LedgerPostingP99 |
| Threshold |
p99 > 200ms |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
The ledger is the system of record for all financial movements. Posting latency above 200ms propagates to every downstream system. Investigate Neon connection pool and Lambda concurrency first. |
Error rate alerts
ALERT-ERROR-001 — Lambda error rate
| Field |
Value |
| Metric source |
AWS/Lambda namespace, metric Errors and Invocations, per function |
| Threshold |
Error rate > 1% (Errors / Invocations) |
| Evaluation period |
Sustained over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
Lambda retries on async invocations (2 retries with backoff). Dead-letter queue captures events that exhaust retries. |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
This alert fires per Lambda function. A single function exceeding 1% error rate triggers the alert. Identify the function from the CloudWatch dimension and investigate its logs in X-Ray. |
ALERT-ERROR-002 — Payment failure rate [REG]
| Field |
Value |
| Metric source |
bank/payments/errors custom namespace, metric PaymentFailureRate |
| Threshold |
Failure rate > 0.5% |
| Evaluation period |
Over 15 minutes |
| Severity |
P1 |
| Auto-mitigation |
None |
| Escalation |
P1: on-call engineer and Head of Technology paged within 15 minutes. Compliance team notified. |
| Notes |
Payments are a critical path. A failure rate above 0.5% over 15 minutes indicates a systemic issue, not isolated retries. This may have obligations under the PSPA (NZ) or PSA (AU) if customers cannot make payments. |
ALERT-ERROR-003 — KYC verification service error rate
| Field |
Value |
| Metric source |
bank/kyc/errors custom namespace, metric VerificationErrorRate |
| Threshold |
Error rate > 5% |
| Evaluation period |
Over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
A high eIDV error rate typically indicates provider API degradation rather than a platform bug. Check eIDV provider status page and Secrets Manager API key validity. At > 5% error rate, customer onboarding is effectively impaired. |
ALERT-ERROR-004 — Authentication error rate
| Field |
Value |
| Metric source |
AWS/Cognito namespace, metric SignInSuccesses and SignInFailures per user pool |
| Threshold |
Authentication error rate > 2% |
| Evaluation period |
Over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
Cognito account lockout after 5 consecutive failures per user. |
| Escalation |
P2: on-call engineer paged within 1 hour. If the pattern suggests a credential stuffing or brute-force attack, escalate to P1 and notify security team. |
| Notes |
2% authentication failure rate may indicate a legitimate service degradation or an authentication attack. Correlate with ALERT-SEC-001 (failed auth rate per IP) to distinguish. |
Infrastructure alerts
ALERT-INFRA-001 — EventBridge DLQ depth
| Field |
Value |
| Metric source |
AWS/SQS namespace, metric ApproximateNumberOfMessagesVisible, per DLQ queue |
| Threshold |
Any depth > 0 |
| Evaluation period |
Any single datapoint |
| Severity |
P2 |
| Auto-mitigation |
None. Events in DLQ require manual investigation and replay. |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
Any event reaching a DLQ represents a delivery failure. Inspect the DLQ message to identify which Lambda target failed and why. Do not delete DLQ messages without understanding the root cause — they may be required for audit. |
ALERT-INFRA-002 — Neon connection pool utilisation
| Field |
Value |
| Metric source |
bank/platform/database custom namespace, metric PgBouncerPoolUtilisation, dimension per database |
| Threshold |
Pool utilisation > 80% |
| Evaluation period |
Sustained over 10 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
At > 80% pool utilisation, new connections will queue and latency will degrade. Investigate Lambda concurrency growth and whether connection acquisition timeouts are occurring. May require scaling the PgBouncer pool configuration. |
ALERT-INFRA-003 — S3 Glacier retrieval failure
| Field |
Value |
| Metric source |
AWS/S3 namespace, metric GetRequests 4xx/5xx errors for Glacier storage class |
| Threshold |
Any retrieval failure |
| Evaluation period |
Any single datapoint |
| Severity |
P3 |
| Auto-mitigation |
None |
| Escalation |
P3: ticket created next business day |
| Notes |
Glacier failures are typically for archival data retrieval (audit logs, historical statements). Not time-critical unless related to a regulatory request, in which case escalate to P2. |
ALERT-INFRA-004 — Lambda concurrency approaching limit
| Field |
Value |
| Metric source |
AWS/Lambda namespace, metric ConcurrentExecutions vs reserved concurrency per function group |
| Threshold |
Concurrent executions > 80% of reserved concurrency |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None. Lambda will throttle beyond the reserved concurrency limit. |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
Approaching the concurrency limit means throttling is imminent, which will cause 429 errors for customers. Investigate the traffic pattern driving the spike and consider whether reserved concurrency needs to be increased for the affected function group. |
Financial integrity alerts
ALERT-FIN-001 — Balance reconciliation discrepancy [REG]
| Field |
Value |
| Metric source |
bank/core/reconciliation custom namespace, metric DiscrepancyCount |
| Threshold |
Any discrepancy (> 0) |
| Evaluation period |
Any single datapoint from the reconciliation engine |
| Severity |
P1 |
| Auto-mitigation |
None. Any discrepancy must be investigated by a human. |
| Escalation |
P1 immediate: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time. Incident record opened. |
| Notes |
Zero-tolerance NFR. A balance discrepancy means the double-entry invariant has been violated. This is a regulatory event and may require notification to the RBNZ or APRA depending on magnitude and cause. Do not clear the alert without a root-cause explanation signed off by engineering and compliance. |
ALERT-FIN-002 — AML engine not receiving posting events [REG]
| Field |
Value |
| Metric source |
bank/aml/event-lag custom namespace, metric PostingToAmlEventLagSeconds |
| Threshold |
Any posting not received by AML engine within 5 minutes of ledger commit |
| Evaluation period |
Calculated per posting; alert fires on first miss |
| Severity |
P1 |
| Auto-mitigation |
None |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified. |
| Notes |
AML monitoring must screen all transactions. A gap in event delivery means transactions are clearing without AML review — a direct AML/CFT compliance breach. The gap must be closed and missed events replayed before the alert can be cleared. |
ALERT-FIN-003 — CDC pipeline lag
| Field |
Value |
| Metric source |
AWS/KinesisFirehose namespace, metric DeliveryToS3.DataFreshness for the CDC delivery stream |
| Threshold |
Lag > 15 minutes |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
Kinesis Firehose retries delivery automatically. |
| Escalation |
P2: on-call engineer paged within 1 hour |
| Notes |
CDC lag above 15 minutes means regulatory reporting and Snowflake data are stale. If the lag persists and a regulatory report is due, escalate to P1. |
Regulatory and compliance alerts
ALERT-COMP-001 — Sanctions list stale [REG]
| Field |
Value |
| Metric source |
bank/aml/sanctions custom namespace, metric SanctionsListAgeHours |
| Threshold |
> 25 hours since last successful refresh |
| Evaluation period |
Any single datapoint |
| Severity |
P1 |
| Auto-mitigation |
Sanctions list refresh runs daily. If the scheduled refresh fails, the module will retry up to 3 times before the alert fires. |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time. |
| Notes |
The 25-hour threshold gives 1 hour of slack over the 24-hour refresh schedule. A stale sanctions list means customer screening may be operating against an outdated dataset — a direct AML/CFT compliance risk. Manual refresh must be triggered immediately. |
ALERT-COMP-002 — AML alert queue backlog
| Field |
Value |
| Metric source |
bank/aml/cases custom namespace, metric AlertQueueDepth |
| Threshold |
Queue depth > 100 |
| Evaluation period |
Average over 5 minutes |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour. Compliance team also notified (not just engineering). |
| Notes |
A queue of > 100 unreviewed AML alerts indicates either a spike in suspicious activity or a processing backlog. Either way, the compliance team needs to be aware. The on-call engineer investigates whether the queue depth is a processing failure; the compliance team reviews the alert content. |
ALERT-COMP-003 — Regulatory report not submitted within SLA [REG]
| Field |
Value |
| Metric source |
bank/reporting/schedule custom namespace, metric ReportOverdueSLA per report |
| Threshold |
Any report exceeds its SLA deadline |
| Evaluation period |
Any single datapoint |
| Severity |
P1 |
| Auto-mitigation |
None |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified immediately — they own the regulator relationship. |
| Notes |
SLA deadlines are hardcoded per report in the reporting module configuration. Late submission to a regulator (RBNZ, APRA, AUSTRAC, FIU) is a notifiable breach under the applicable legislation. |
ALERT-COMP-004 — KYC gate bypass attempt [REG]
| Field |
Value |
| Metric source |
bank/kyc/security custom namespace, metric GateBypassAttempts |
| Threshold |
Any detected bypass attempt (> 0) |
| Evaluation period |
Any single datapoint |
| Severity |
P1 |
| Auto-mitigation |
The bypass attempt is rejected by the gate. The alert is raised for investigation regardless. |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer and security team notified. |
| Notes |
A bypass attempt means something or someone attempted to open an account or perform a KYC-gated action without completing identity verification. This could be a misconfigured client, a code regression, or a deliberate circumvention attempt. Treat as a security incident until proven otherwise. |
Security alerts
ALERT-SEC-001 — Brute-force authentication indicator
| Field |
Value |
| Metric source |
AWS/WAF namespace, metric BlockedRequests per rule RateLimit-AuthEndpoint, or custom bank/auth/security metric FailedAuthPerIP |
| Threshold |
> 10 failed authentication attempts per minute from a single IP |
| Evaluation period |
Over 1 minute |
| Severity |
P1 |
| Auto-mitigation |
WAF rate-limit rule blocks the IP after threshold breach. Cognito account lockout applies per-user. |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified. |
| Notes |
10 failures per minute per IP is a conservative threshold that filters out legitimate retry loops. Investigate whether the IP is a known proxy or Tor exit node. Consider blocking at the WAF level if the pattern persists. |
ALERT-SEC-002 — Cognito admin API call outside change window
| Field |
Value |
| Metric source |
CloudTrail cognito-idp.amazonaws.com events, eventName in [AdminCreateUser, AdminDeleteUser, AdminSetUserPassword, AdminUpdateUserPool, UpdateUserPool], evaluated against approved change window schedule |
| Threshold |
Any qualifying API call outside the approved change window |
| Evaluation period |
Any single event |
| Severity |
P2 |
| Auto-mitigation |
None |
| Escalation |
P2: on-call engineer paged within 1 hour. Security team notified. |
| Notes |
Cognito admin changes outside a change window may indicate an unauthorised modification to the authentication configuration. Changes during a change window should be correlated against the approved change ticket. |
ALERT-SEC-003 — IAM role assumption from unexpected principal
| Field |
Value |
| Metric source |
CloudTrail sts.amazonaws.com events, eventName=AssumeRole, cross-referenced against approved principals list |
| Threshold |
Any AssumeRole event where the calling principal is not in the approved list |
| Evaluation period |
Any single event |
| Severity |
P1 |
| Auto-mitigation |
None. STS does not block the call — this alert is detective, not preventive. |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified. Treat as a potential credential compromise until proven otherwise. |
| Notes |
Maintain the approved principals list in the MOD-076 configuration. Review the list during every deployment to ensure decommissioned services are removed. |
ALERT-SEC-004 — CloudTrail logging gap [REG]
| Field |
Value |
| Metric source |
CloudTrail metric filter on trail delivery, or AWS/CloudTrail metric EventsDeliveredToS3 |
| Threshold |
No CloudTrail events delivered for > 15 minutes |
| Evaluation period |
Over 15 minutes |
| Severity |
P1 |
| Auto-mitigation |
None |
| Escalation |
P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified. |
| Notes |
A CloudTrail gap means API activity during that window is not auditable. This is a compliance and forensic integrity issue. RBNZ and APRA both require continuous audit trail availability. Investigate whether the trail is disabled, the S3 bucket is full, or there is a KMS key issue preventing log encryption. |
Alert code reference
| Code |
Name |
Severity |
| ALERT-LATENCY-001 |
Customer-facing API p99 latency |
P2 |
| ALERT-LATENCY-002 |
Internal service call p99 latency |
P3 |
| ALERT-LATENCY-003 |
Payment processing p99 latency |
P2 |
| ALERT-LATENCY-004 |
Core ledger posting p99 latency |
P2 |
| ALERT-ERROR-001 |
Lambda error rate |
P2 |
| ALERT-ERROR-002 |
Payment failure rate [REG] |
P1 |
| ALERT-ERROR-003 |
KYC verification service error rate |
P2 |
| ALERT-ERROR-004 |
Authentication error rate |
P2 |
| ALERT-INFRA-001 |
EventBridge DLQ depth |
P2 |
| ALERT-INFRA-002 |
Neon connection pool utilisation |
P2 |
| ALERT-INFRA-003 |
S3 Glacier retrieval failure |
P3 |
| ALERT-INFRA-004 |
Lambda concurrency approaching limit |
P2 |
| ALERT-FIN-001 |
Balance reconciliation discrepancy [REG] |
P1 |
| ALERT-FIN-002 |
AML engine not receiving posting events [REG] |
P1 |
| ALERT-FIN-003 |
CDC pipeline lag |
P2 |
| ALERT-COMP-001 |
Sanctions list stale [REG] |
P1 |
| ALERT-COMP-002 |
AML alert queue backlog |
P2 |
| ALERT-COMP-003 |
Regulatory report not submitted [REG] |
P1 |
| ALERT-COMP-004 |
KYC gate bypass attempt [REG] |
P1 |
| ALERT-SEC-001 |
Brute-force authentication indicator |
P1 |
| ALERT-SEC-002 |
Cognito admin API call outside change window |
P2 |
| ALERT-SEC-003 |
IAM role assumption from unexpected principal |
P1 |
| ALERT-SEC-004 |
CloudTrail logging gap [REG] |
P1 |
[REG] = Regulatory implication. Chief Compliance Officer notified at alert time. Regulator contact initiated if unresolved after 1 hour.