Alert thresholds¶

Resolves: GAP-D09 — No alert threshold specification.

This document specifies all alert thresholds for the platform observability layer (MOD-076). MOD-076 is not yet built, but thresholds are specified here so the module can be configured correctly at build time and so on-call engineers understand what each alert means before it fires.

All alerts route via SNS to the on-call notification channel. P1 alerts also page the on-call engineer via PagerDuty. P1 alerts with regulatory implications additionally notify the Chief Compliance Officer.

Severity and escalation¶

Severity	Response time	Who	Required action
P1	15 minutes	On-call engineer (paged) + Head of Technology	Immediate investigation. If unresolved after 1 hour: regulator notification if the alert has regulatory implications (see note below). Incident record opened immediately.
P2	1 hour	On-call engineer (paged)	Investigation and resolution plan within 1 hour. Incident record opened.
P3	Next business day	Engineering team	Ticket created and triaged. No paging.

Regulatory notifications: P1 alerts marked [REG] below also trigger a compliance notification to the Chief Compliance Officer at the time of alert. If the incident is not resolved within 1 hour, the compliance team initiates regulator contact procedures per the incident response policy.

Auto-mitigation: Where Lambda auto-retry or other automated responses are configured, the alert still fires at the threshold — mitigation is supplementary, not a substitute for investigation.

Latency alerts¶

ALERT-LATENCY-001 — Customer-facing API p99 latency¶

Field	Value
Metric source	`AWS/ApiGateway` namespace, metric `IntegrationLatency`, dimension `ApiName=bank-customer-api`
Threshold	p99 > 1000ms
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour
Notes	NFR target is ≤ 1000ms p99 for customer-facing APIs. Sustained breach indicates Lambda cold-start accumulation, database connection pool exhaustion, or upstream provider degradation.

ALERT-LATENCY-002 — Internal service call p99 latency¶

Field	Value
Metric source	`bank/platform/latency` custom namespace, metric `InternalApiP99`, dimension per system domain
Threshold	p99 > 200ms
Evaluation period	Average over 5 minutes
Severity	P3
Auto-mitigation	None
Escalation	P3: ticket created next business day
Notes	NFR target is ≤ 200ms p99 for internal service calls. P3 rather than P2 because customers are not directly impacted; the breach will typically surface as a P2 latency alert downstream before causing customer impact.

ALERT-LATENCY-003 — Payment processing p99 latency¶

Field	Value
Metric source	`bank/payments/latency` custom namespace, metric `PaymentProcessingP99`
Threshold	p99 > 500ms
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour
Notes	NFR target is ≤ 500ms p99 for payment processing. Payments are a critical customer-facing path; a sustained p99 breach indicates a systemic issue.

ALERT-LATENCY-004 — Core ledger posting p99 latency¶

Field	Value
Metric source	`bank/core/latency` custom namespace, metric `LedgerPostingP99`
Threshold	p99 > 200ms
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour
Notes	The ledger is the system of record for all financial movements. Posting latency above 200ms propagates to every downstream system. Investigate Neon connection pool and Lambda concurrency first.

Error rate alerts¶

ALERT-ERROR-001 — Lambda error rate¶

Field	Value
Metric source	`AWS/Lambda` namespace, metric `Errors` and `Invocations`, per function
Threshold	Error rate > 1% (Errors / Invocations)
Evaluation period	Sustained over 5 minutes
Severity	P2
Auto-mitigation	Lambda retries on async invocations (2 retries with backoff). Dead-letter queue captures events that exhaust retries.
Escalation	P2: on-call engineer paged within 1 hour
Notes	This alert fires per Lambda function. A single function exceeding 1% error rate triggers the alert. Identify the function from the CloudWatch dimension and investigate its logs in X-Ray.

ALERT-ERROR-002 — Payment failure rate [REG]¶

Field	Value
Metric source	`bank/payments/errors` custom namespace, metric `PaymentFailureRate`
Threshold	Failure rate > 0.5%
Evaluation period	Over 15 minutes
Severity	P1
Auto-mitigation	None
Escalation	P1: on-call engineer and Head of Technology paged within 15 minutes. Compliance team notified.
Notes	Payments are a critical path. A failure rate above 0.5% over 15 minutes indicates a systemic issue, not isolated retries. This may have obligations under the PSPA (NZ) or PSA (AU) if customers cannot make payments.

ALERT-ERROR-003 — KYC verification service error rate¶

Field	Value
Metric source	`bank/kyc/errors` custom namespace, metric `VerificationErrorRate`
Threshold	Error rate > 5%
Evaluation period	Over 5 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour
Notes	A high eIDV error rate typically indicates provider API degradation rather than a platform bug. Check eIDV provider status page and Secrets Manager API key validity. At > 5% error rate, customer onboarding is effectively impaired.

ALERT-ERROR-004 — Authentication error rate¶

Field	Value
Metric source	`AWS/Cognito` namespace, metric `SignInSuccesses` and `SignInFailures` per user pool
Threshold	Authentication error rate > 2%
Evaluation period	Over 5 minutes
Severity	P2
Auto-mitigation	Cognito account lockout after 5 consecutive failures per user.
Escalation	P2: on-call engineer paged within 1 hour. If the pattern suggests a credential stuffing or brute-force attack, escalate to P1 and notify security team.
Notes	2% authentication failure rate may indicate a legitimate service degradation or an authentication attack. Correlate with ALERT-SEC-001 (failed auth rate per IP) to distinguish.

Infrastructure alerts¶

ALERT-INFRA-001 — EventBridge DLQ depth¶

Field	Value
Metric source	`AWS/SQS` namespace, metric `ApproximateNumberOfMessagesVisible`, per DLQ queue
Threshold	Any depth > 0
Evaluation period	Any single datapoint
Severity	P2
Auto-mitigation	None. Events in DLQ require manual investigation and replay.
Escalation	P2: on-call engineer paged within 1 hour
Notes	Any event reaching a DLQ represents a delivery failure. Inspect the DLQ message to identify which Lambda target failed and why. Do not delete DLQ messages without understanding the root cause — they may be required for audit.

ALERT-INFRA-002 — Neon connection pool utilisation¶

Field	Value
Metric source	`bank/platform/database` custom namespace, metric `PgBouncerPoolUtilisation`, dimension per database
Threshold	Pool utilisation > 80%
Evaluation period	Sustained over 10 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour
Notes	At > 80% pool utilisation, new connections will queue and latency will degrade. Investigate Lambda concurrency growth and whether connection acquisition timeouts are occurring. May require scaling the PgBouncer pool configuration.

ALERT-INFRA-003 — S3 Glacier retrieval failure¶

Field	Value
Metric source	`AWS/S3` namespace, metric `GetRequests` 4xx/5xx errors for Glacier storage class
Threshold	Any retrieval failure
Evaluation period	Any single datapoint
Severity	P3
Auto-mitigation	None
Escalation	P3: ticket created next business day
Notes	Glacier failures are typically for archival data retrieval (audit logs, historical statements). Not time-critical unless related to a regulatory request, in which case escalate to P2.

ALERT-INFRA-004 — Lambda concurrency approaching limit¶

Field	Value
Metric source	`AWS/Lambda` namespace, metric `ConcurrentExecutions` vs reserved concurrency per function group
Threshold	Concurrent executions > 80% of reserved concurrency
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	None. Lambda will throttle beyond the reserved concurrency limit.
Escalation	P2: on-call engineer paged within 1 hour
Notes	Approaching the concurrency limit means throttling is imminent, which will cause 429 errors for customers. Investigate the traffic pattern driving the spike and consider whether reserved concurrency needs to be increased for the affected function group.

Financial integrity alerts¶

ALERT-FIN-001 — Balance reconciliation discrepancy [REG]¶

Field	Value
Metric source	`bank/core/reconciliation` custom namespace, metric `DiscrepancyCount`
Threshold	Any discrepancy (> 0)
Evaluation period	Any single datapoint from the reconciliation engine
Severity	P1
Auto-mitigation	None. Any discrepancy must be investigated by a human.
Escalation	P1 immediate: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time. Incident record opened.
Notes	Zero-tolerance NFR. A balance discrepancy means the double-entry invariant has been violated. This is a regulatory event and may require notification to the RBNZ or APRA depending on magnitude and cause. Do not clear the alert without a root-cause explanation signed off by engineering and compliance.

ALERT-FIN-002 — AML engine not receiving posting events [REG]¶

Field	Value
Metric source	`bank/aml/event-lag` custom namespace, metric `PostingToAmlEventLagSeconds`
Threshold	Any posting not received by AML engine within 5 minutes of ledger commit
Evaluation period	Calculated per posting; alert fires on first miss
Severity	P1
Auto-mitigation	None
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified.
Notes	AML monitoring must screen all transactions. A gap in event delivery means transactions are clearing without AML review — a direct AML/CFT compliance breach. The gap must be closed and missed events replayed before the alert can be cleared.

ALERT-FIN-003 — CDC pipeline lag¶

Field	Value
Metric source	`AWS/KinesisFirehose` namespace, metric `DeliveryToS3.DataFreshness` for the CDC delivery stream
Threshold	Lag > 15 minutes
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	Kinesis Firehose retries delivery automatically.
Escalation	P2: on-call engineer paged within 1 hour
Notes	CDC lag above 15 minutes means regulatory reporting and Snowflake data are stale. If the lag persists and a regulatory report is due, escalate to P1.

Regulatory and compliance alerts¶

ALERT-COMP-001 — Sanctions list stale [REG]¶

Field	Value
Metric source	`bank/aml/sanctions` custom namespace, metric `SanctionsListAgeHours`
Threshold	> 25 hours since last successful refresh
Evaluation period	Any single datapoint
Severity	P1
Auto-mitigation	Sanctions list refresh runs daily. If the scheduled refresh fails, the module will retry up to 3 times before the alert fires.
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified at alert time.
Notes	The 25-hour threshold gives 1 hour of slack over the 24-hour refresh schedule. A stale sanctions list means customer screening may be operating against an outdated dataset — a direct AML/CFT compliance risk. Manual refresh must be triggered immediately.

ALERT-COMP-002 — AML alert queue backlog¶

Field	Value
Metric source	`bank/aml/cases` custom namespace, metric `AlertQueueDepth`
Threshold	Queue depth > 100
Evaluation period	Average over 5 minutes
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour. Compliance team also notified (not just engineering).
Notes	A queue of > 100 unreviewed AML alerts indicates either a spike in suspicious activity or a processing backlog. Either way, the compliance team needs to be aware. The on-call engineer investigates whether the queue depth is a processing failure; the compliance team reviews the alert content.

ALERT-COMP-003 — Regulatory report not submitted within SLA [REG]¶

Field	Value
Metric source	`bank/reporting/schedule` custom namespace, metric `ReportOverdueSLA` per report
Threshold	Any report exceeds its SLA deadline
Evaluation period	Any single datapoint
Severity	P1
Auto-mitigation	None
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified immediately — they own the regulator relationship.
Notes	SLA deadlines are hardcoded per report in the reporting module configuration. Late submission to a regulator (RBNZ, APRA, AUSTRAC, FIU) is a notifiable breach under the applicable legislation.

ALERT-COMP-004 — KYC gate bypass attempt [REG]¶

Field	Value
Metric source	`bank/kyc/security` custom namespace, metric `GateBypassAttempts`
Threshold	Any detected bypass attempt (> 0)
Evaluation period	Any single datapoint
Severity	P1
Auto-mitigation	The bypass attempt is rejected by the gate. The alert is raised for investigation regardless.
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer and security team notified.
Notes	A bypass attempt means something or someone attempted to open an account or perform a KYC-gated action without completing identity verification. This could be a misconfigured client, a code regression, or a deliberate circumvention attempt. Treat as a security incident until proven otherwise.

Security alerts¶

ALERT-SEC-001 — Brute-force authentication indicator¶

Field	Value
Metric source	`AWS/WAF` namespace, metric `BlockedRequests` per rule `RateLimit-AuthEndpoint`, or custom `bank/auth/security` metric `FailedAuthPerIP`
Threshold	> 10 failed authentication attempts per minute from a single IP
Evaluation period	Over 1 minute
Severity	P1
Auto-mitigation	WAF rate-limit rule blocks the IP after threshold breach. Cognito account lockout applies per-user.
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified.
Notes	10 failures per minute per IP is a conservative threshold that filters out legitimate retry loops. Investigate whether the IP is a known proxy or Tor exit node. Consider blocking at the WAF level if the pattern persists.

ALERT-SEC-002 — Cognito admin API call outside change window¶

Field	Value
Metric source	CloudTrail `cognito-idp.amazonaws.com` events, `eventName` in [AdminCreateUser, AdminDeleteUser, AdminSetUserPassword, AdminUpdateUserPool, UpdateUserPool], evaluated against approved change window schedule
Threshold	Any qualifying API call outside the approved change window
Evaluation period	Any single event
Severity	P2
Auto-mitigation	None
Escalation	P2: on-call engineer paged within 1 hour. Security team notified.
Notes	Cognito admin changes outside a change window may indicate an unauthorised modification to the authentication configuration. Changes during a change window should be correlated against the approved change ticket.

ALERT-SEC-003 — IAM role assumption from unexpected principal¶

Field	Value
Metric source	CloudTrail `sts.amazonaws.com` events, `eventName=AssumeRole`, cross-referenced against approved principals list
Threshold	Any `AssumeRole` event where the calling principal is not in the approved list
Evaluation period	Any single event
Severity	P1
Auto-mitigation	None. STS does not block the call — this alert is detective, not preventive.
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Security team notified. Treat as a potential credential compromise until proven otherwise.
Notes	Maintain the approved principals list in the MOD-076 configuration. Review the list during every deployment to ensure decommissioned services are removed.

ALERT-SEC-004 — CloudTrail logging gap [REG]¶

Field	Value
Metric source	CloudTrail metric filter on trail delivery, or `AWS/CloudTrail` metric `EventsDeliveredToS3`
Threshold	No CloudTrail events delivered for > 15 minutes
Evaluation period	Over 15 minutes
Severity	P1
Auto-mitigation	None
Escalation	P1: on-call engineer + Head of Technology paged within 15 minutes. Chief Compliance Officer notified.
Notes	A CloudTrail gap means API activity during that window is not auditable. This is a compliance and forensic integrity issue. RBNZ and APRA both require continuous audit trail availability. Investigate whether the trail is disabled, the S3 bucket is full, or there is a KMS key issue preventing log encryption.

Alert code reference¶

Code	Name	Severity
ALERT-LATENCY-001	Customer-facing API p99 latency	P2
ALERT-LATENCY-002	Internal service call p99 latency	P3
ALERT-LATENCY-003	Payment processing p99 latency	P2
ALERT-LATENCY-004	Core ledger posting p99 latency	P2
ALERT-ERROR-001	Lambda error rate	P2
ALERT-ERROR-002	Payment failure rate [REG]	P1
ALERT-ERROR-003	KYC verification service error rate	P2
ALERT-ERROR-004	Authentication error rate	P2
ALERT-INFRA-001	EventBridge DLQ depth	P2
ALERT-INFRA-002	Neon connection pool utilisation	P2
ALERT-INFRA-003	S3 Glacier retrieval failure	P3
ALERT-INFRA-004	Lambda concurrency approaching limit	P2
ALERT-FIN-001	Balance reconciliation discrepancy [REG]	P1
ALERT-FIN-002	AML engine not receiving posting events [REG]	P1
ALERT-FIN-003	CDC pipeline lag	P2
ALERT-COMP-001	Sanctions list stale [REG]	P1
ALERT-COMP-002	AML alert queue backlog	P2
ALERT-COMP-003	Regulatory report not submitted [REG]	P1
ALERT-COMP-004	KYC gate bypass attempt [REG]	P1
ALERT-SEC-001	Brute-force authentication indicator	P1
ALERT-SEC-002	Cognito admin API call outside change window	P2
ALERT-SEC-003	IAM role assumption from unexpected principal	P1
ALERT-SEC-004	CloudTrail logging gap [REG]	P1

[REG] = Regulatory implication. Chief Compliance Officer notified at alert time. Regulator contact initiated if unresolved after 1 hour.