DR runbook¶
Disaster recovery procedures for the Totara Bank platform. This runbook covers five failure scenarios ordered by frequency and impact. Each scenario includes indicators, severity, step-by-step recovery actions, and expected recovery times.
Resolves gap: GAP-D05 — No DR runbook. Complementary policy: OPS-002 (Disaster Recovery Policy).
Infrastructure baseline: All workloads run in AWS ap-southeast-2 (Sydney). Compute is stateless Lambda functions (all system domains). Stateful data lives in Neon Postgres (six databases), Snowflake (SD06/SD07), and Cognito (two user pools). Secrets are in AWS Secrets Manager. See backup and recovery for backup schedules and PITR procedures.
RTO and RPO targets¶
| Service tier | Examples | RTO | RPO |
|---|---|---|---|
| Critical banking services | Core ledger (SD01), payments processing (SD04), KYC gate (SD02) | ≤ 4 hours | ≤ 1 hour |
| Regulatory reporting pipelines | AUSTRAC reporting (SD03), APRA/RBNZ reporting (SD07) | ≤ 24 hours | ≤ 4 hours |
| Analytics and risk platform | Snowflake (SD06, SD07) | ≤ 48 hours | ≤ 24 hours |
| Internal tooling and non-customer-facing services | Staff portals, batch reconciliation | ≤ 24 hours | ≤ 24 hours |
These targets are derived from NFR thresholds. The RTO and RPO values above are the maximum acceptable figures — aim to beat them.
Pre-execution checklist¶
Before beginning any recovery procedure:
- Declare the incident. Open an incident in the incident management system. Record the trigger time.
- Assign roles. Incident commander (IC), technical lead (TL), communications lead. The IC has final call on decisions.
- Open the war room. Use the designated incident Slack channel and video bridge.
- Do not guess — read the indicators. Each scenario below lists what to check first.
- Log every action. Timestamp every step in the incident log. This is required for regulator notification.
Scenario 1 — Lambda function failure (partial service degradation)¶
Severity: Medium. Affects one module or system domain. Other services continue normally.
Indicators: - CloudWatch alarm on Lambda error rate (threshold: > 5% error rate over 5 minutes for any function). - Customer-facing errors on a specific feature (e.g. payments failing, KYC checks timing out). - X-Ray traces showing consistent failure on a specific invocation path. - No Neon or Cognito health-check failures.
Resolution path:
Lambda functions are stateless and deployed across multiple AZs by default. A single AZ failure does not affect service. Sustained errors are almost always a code or configuration issue.
-
Identify the failing function. Open CloudWatch Logs Insights and run:
Filter by the function name from the alarm. -
Check the error pattern. Common root causes:
SecretNotFound/AccessDeniedException→ secret missing or IAM permission changed. See scenario 4.Connection refused/Connection timed out→ database connectivity issue. Check Neon status.Task timed out→ Lambda timeout too short for current load, or downstream dependency slow.-
Application exception with stack trace → code bug in latest deployment.
-
For a code bug: Roll back to the last known-good artefact via the pipeline. Do not attempt to fix and redeploy under pressure — roll back first, then investigate.
- In GitHub Actions: re-run the last successful production deployment workflow.
-
Alternatively:
aws lambda update-function-code --function-name {name} --s3-bucket {artefact-bucket} --s3-key {last-good-artefact-key} -
For a configuration issue: Check SSM Parameter Store and Secrets Manager for recent changes. If a parameter was changed, revert to the previous version.
-
For a transient dependency failure: Lambda will auto-retry on asynchronous invocations. For synchronous invocations (API Gateway → Lambda), the client receives an error and should retry. If the dependency (Neon, third-party API) recovers within minutes, no action is needed.
-
Verify recovery. Confirm the CloudWatch alarm clears. Run the smoke test for the affected module.
-
Close the incident. Record RTO achieved, root cause, and any follow-up action items.
Scenario 2 — Neon database failure¶
Severity: High. Affects all modules in the system domain whose database is unavailable.
Indicators:
- Lambda functions returning Connection pool exhausted or Error connecting to database.
- Health check endpoint for the affected system domain returning 503.
- Neon dashboard at console.neon.tech showing an incident or degraded status.
- PgBouncer connection timeouts visible in CloudWatch Logs.
Neon's built-in resilience:
Neon automatically promotes a replica to primary within approximately 30 seconds of a primary failure. For most transient failures, no manual intervention is needed — the connection pool will reconnect automatically after the brief failover window.
Recovery steps for sustained or data-related failures:
-
Check Neon status first. Go to the Neon console and check the project status for the affected database. Check the Neon status page at neonstatus.com. If Neon is reporting an infrastructure incident, the recovery is Neon's responsibility — follow their incident updates and do not attempt to bypass.
-
If data corruption is suspected (wrong records, failed constraint violations, unexpected data loss):
a. Identify the last clean point in time. Connect to the database with a read-only admin connection and query:
SELECT installed_rank, version, description, installed_on
FROM flyway_schema_history
ORDER BY installed_on DESC
LIMIT 10;
b. Create a recovery branch in Neon. Via the Neon API:
curl -X POST "https://console.neon.tech/api/v2/projects/{project_id}/branches" \
-H "Authorization: Bearer {neon_api_key}" \
-H "Content-Type: application/json" \
-d '{
"endpoints": [{"type": "read_write"}],
"branch": {
"parent_id": "prod",
"parent_timestamp": "2026-04-18T14:00:00Z"
}
}'
c. Verify the recovery branch. Connect to the new branch and verify record counts and key data points match the expected state. Do not proceed until this is confirmed.
d. Update Secrets Manager with the new connection strings. The recovery branch has a new connection string (different hostname). Update the relevant secret:
aws secretsmanager put-secret-value \
--secret-id /bank/prod/{domain}/neon-db-url \
--secret-string "{new connection string}"
e. Force Lambda cold starts. Lambda caches the secret value. Force a refresh by deploying a trivial config update (e.g. update an environment variable) to all Lambda functions in the affected domain. This triggers a cold start and picks up the new connection string.
f. Run the reconciliation report. Execute the daily reconciliation job manually for the affected domain. Verify the output matches expected balances and record counts. Do not declare recovery complete until reconciliation passes.
- For a full Neon project failure (project unreachable, not just primary failover):
- Follow the backup and recovery PITR procedure.
-
Neon support ticket should be raised in parallel with any self-service recovery attempt.
-
Verify and close. Confirm the system domain health check returns 200. Run smoke tests. Record RTO and RPO achieved in the incident log.
Scenario 3 — AWS region failure (full region outage)¶
Severity: Critical. Complete service unavailability for all customers and all domains.
Indicators: - All health check endpoints return errors or time out. - AWS Service Health Dashboard shows ap-southeast-2 (Sydney) incident. - CloudWatch metrics stop updating entirely. - No partial recovery in Scenario 1 or 2 procedures explains the scope.
Current architecture limitation:
The platform is single-region (ap-southeast-2 only). A full AWS Sydney region outage means complete unavailability. There is no automatic failover to another region. This is a known and accepted risk — multi-region active-active is on the architecture roadmap but not yet implemented.
RTO for a full region outage: up to 24 hours. This reflects the manual effort required to rebuild the stack in an alternate region.
Recovery path:
- Confirm it is a region failure, not a multi-system configuration issue. Check:
- AWS Service Health Dashboard: health.aws.amazon.com
- Neon status page
- Internal monitoring in a different region (if any canaries exist in ap-southeast-4)
-
Do not begin region migration until a region failure is confirmed — a premature migration splits traffic and worsens recovery.
-
Activate the alternate region (ap-southeast-4 / Melbourne).
- AWS account must already exist with base networking (VPC, subnets, security groups) provisioned via IaC. If it does not, provision it now from the IaC repository baseline.
-
Ensure the alternate-region AWS account has all required service quotas (Lambda, Secrets Manager, Cognito).
-
Restore Neon databases. Neon's primary infrastructure is in AWS regions but operates as a managed service. Check whether Neon is also affected:
- If Neon is available (Neon may use redundant infrastructure): create new endpoints in the existing Neon project pointing to the recovery region's VPC.
-
If Neon is also unavailable: restore from the Iceberg/S3 cross-region snapshot. See backup and recovery → Neon cross-region restore procedure.
-
Re-deploy IaC (SST stacks) to ap-southeast-4.
- Pull the latest release artefact from the artefact registry (artefacts are stored in S3, which has cross-region replication configured).
- Run the SST deployment targeting ap-southeast-4:
-
Deploy in the order defined in the deployment sequence: infrastructure → data layer → platform services → domain services → app layer.
-
Restore Secrets Manager values in ap-southeast-4.
- Secrets Manager is regional. Secrets do not automatically replicate to ap-southeast-4.
- Provision all secrets from the secrets manifest in the new region.
- For Neon connection strings: new endpoints in the alternate region will have new hostnames — provision the updated URLs.
-
For third-party API keys: these are region-agnostic — use the same key values as prod ap-southeast-2.
-
Update DNS records.
- Update Route 53 (or the equivalent DNS provider) to point the platform's API domains to the new region's API Gateway endpoints.
-
TTL should have been set to a low value (60s) during normal operation to enable fast failover. Confirm this is in place — if TTL is high, DNS propagation will extend the RTO.
-
Run the smoke test suite against ap-southeast-4. All critical paths must pass before directing customer traffic.
-
Notify regulators. See the regulatory notification section below.
-
Declare partial recovery. Update the incident log with recovery time. Continue monitoring.
-
Plan the return to ap-southeast-2. Once ap-southeast-2 is restored, plan a controlled migration back. Do not rush — running stably in ap-southeast-4 is preferable to a rushed, untested migration back.
Scenario 4 — Secrets Manager secrets corrupted or deleted¶
Severity: High. All Lambda functions in the affected domain fail to start.
Indicators:
- Widespread SecretNotFound or ResourceNotFoundException errors across multiple Lambda functions.
- CloudWatch Logs show: Secrets Manager can't find the specified secret.
- All functions in one or more system domains returning 500 errors simultaneously.
- No Neon or network issues visible.
Recovery steps:
-
Identify the scope. Which secrets are missing? Run:
Compare the output against the secrets manifest. Note any missing paths.aws secretsmanager list-secrets \ --filter Key=name,Values=/bank/prod/ \ --query 'SecretList[].Name' \ --output text -
Attempt restore from deletion (within 7-day recovery window).
AWS Secrets Manager retains deleted secrets for 7 days before permanent deletion. Restore with:
Do this for all missing secrets. Verify restoration with:- If beyond the 7-day recovery window (permanent deletion), secrets must be re-provisioned from scratch:
- For Neon connection strings: retrieve from Neon console.
- For third-party API keys (eIDV, sanctions feed, payment rails, etc.): contact each third-party provider to re-issue keys. This may take hours to days depending on the provider.
- For government-issued credentials (AUSTRAC, FIU NZ, APRA, RBNZ): contact the relevant regulator. Lead times can be 2–4 weeks — prioritise these immediately and consider temporary suspension of affected reporting obligations, notifying regulators of the delay.
-
Provision all re-issued secrets per the secrets manifest procedures.
-
Force Lambda cold starts. Once secrets are restored, force cold starts on all affected Lambda functions to pick up the restored values (see Scenario 2, step 2e).
-
Investigate root cause. Accidental deletion of Secrets Manager secrets should trigger an IAM investigation. Check CloudTrail for
DeleteSecretevents: -
Remediation. Review and tighten the IAM policy for
secretsmanager:DeleteSecret. Consider adding an SCP (Service Control Policy) at the AWS organisation level that denies delete of secrets matching/bank/prod/*except by a designated break-glass role.
Scenario 5 — Cognito user pool corruption or failure¶
Severity: Critical. All customers unable to authenticate.
Indicators:
- All customer login attempts failing with UserNotFoundException, UserPoolDoesNotExistException, or similar Cognito errors.
- Staff unable to log in to internal tools.
- No Neon or Lambda issues visible — the authentication layer is the failure point.
- CloudWatch metrics for the Cognito user pool show zero successful authentications.
Architecture note:
Cognito has no built-in point-in-time recovery or backup. The platform's current mitigation is a nightly Lambda export of user attributes to S3 (not passwords — Cognito does not allow password export). This export Lambda is a pending action item — until it is implemented, the only Cognito recovery path is re-creating the user pool from IaC and requiring all customers to reset their passwords.
Recovery steps:
- Confirm the failure scope. Determine whether:
- The user pool exists but is degraded (check AWS Cognito console and Service Health Dashboard).
- The user pool has been deleted (check CloudTrail for
DeleteUserPoolevents). -
A configuration change has broken authentication (check recent AppConfig or IaC changes).
-
For a configuration change that broke authentication:
- Revert the IaC change and redeploy.
-
Check AppConfig for any changes to the Cognito configuration profile — roll back if needed.
-
For user pool corruption or deletion:
a. Check for the S3 user attribute export. Look in s3://bank-exports-prod/cognito/{pool-id}/ for the most recent export file. If the export exists, note the export timestamp (this is the RPO).
b. Re-create the user pool from IaC.
The IaC stack will create a new user pool with the same configuration (password policy, MFA settings, app clients, triggers). The new pool will have a different pool ID.c. Update SSM Parameter Store with the new pool ID:
aws ssm put-parameter \
--name /bank/prod/cognito/customer-pool-id \
--value {new-pool-id} \
--overwrite
d. Re-import user attributes from S3 export (if export exists). Use the Cognito admin import job feature:
aws cognito-idp create-user-import-job \
--user-pool-id {new-pool-id} \
--job-name cognito-recovery-{date} \
--cloud-watch-logs-role-arn {role-arn}
e. Require password reset for all users. User passwords cannot be exported or imported. Every customer must set a new password: - Send password-reset emails to all users immediately. - Update the customer communications plan — this is an all-customer impact event. - Temporarily enable the "Allow users to sign in using their email and a temporary password" flow.
-
Notify customers. Draft a customer notification for all-customer authentication outage. This requires the communications lead and legal review — do not send without approval.
-
Post-incident action. Implement the Cognito nightly export Lambda immediately after recovery. This is the highest-severity single point of failure in the current architecture with no automated mitigation. Assign to the next sprint.
Regulatory notification requirements¶
Any outage affecting customer access to accounts or payment services that lasts more than 1 hour is a "material outage" requiring regulatory notification.
| Regulator | Jurisdiction | Notification deadline | Contact |
|---|---|---|---|
| Reserve Bank of New Zealand (RBNZ) | NZ | Within 24 hours of identifying a material outage | RBNZ notification portal / dedicated relationship manager |
| Australian Prudential Regulation Authority (APRA) | AU | Within 24 hours | APRA Connect portal — CPS 234 notification |
| Australian Securities and Investments Commission (ASIC) | AU | If a financial services obligation is breached | ASIC breach reporting portal |
The incident commander is responsible for triggering regulator notification. Legal and compliance must be involved in drafting any regulatory notification. Do not file a notification without sign-off from the Chief Compliance Officer.
Runbook execution log¶
Every DR event must produce an incident record with the following fields. Create this in the incident management system immediately on declaration and complete it at close.
| Field | Description |
|---|---|
| Incident ID | Auto-assigned by incident management system |
| Scenario | Which scenario (1–5) or "other" |
| Trigger time | When the failure was first detected (not when the alarm fired) |
| Declaration time | When IC declared a DR event |
| Recovery start | When recovery actions began |
| Recovery end | When service was restored to normal operation |
| RTO achieved | Actual elapsed time from trigger to recovery |
| RPO achieved | Data loss window (if any) |
| Root cause | One paragraph: what failed and why |
| Contributing factors | What made it worse or prolonged it |
| Action items | Numbered list; each item has an owner and due date |
| Regulatory notification | Whether notified, which regulator, when |
Related pages¶
- Backup and recovery — backup schedules, PITR procedures, Neon branch management
- Secrets manifest — all secrets required in Secrets Manager
- Provisioning playbook — full environment build procedures
- Alert thresholds — CloudWatch alarm definitions referenced in scenario indicators
- OPS-002 — Disaster Recovery Policy (governance and ownership)
- ADR-031 — Observability tooling (OpenTelemetry, CloudWatch, X-Ray)