Skip to content

Backup and recovery

This page specifies backup schedules, retention policies, and point-in-time recovery (PITR) procedures for every stateful data store in the platform. It is the reference document for the DR runbook data-recovery steps.

Resolves gap: GAP-D06 — No database backup and recovery specification.

Stateful data stores in scope:

Store System domains Backup mechanism
Neon Postgres (6 databases) SD01, SD02, SD03, SD04, SD05, SD08 Neon managed — WAL archiving + snapshots
S3 (Iceberg, documents, reports) SD06, SD07 + all domains S3 versioning + lifecycle
Snowflake SD06, SD07 Snowflake Time Travel + Fail-safe
AWS Secrets Manager All 7-day deletion recovery window
AWS AppConfig All Version history
Cognito user pools All Manual export (action item — not yet automated)

Neon Postgres backups

How Neon backs up data

Neon takes continuous WAL (write-ahead log) archiving combined with periodic base snapshots. This is fully managed by Neon — no configuration or scheduling is required on the platform side. The result is continuous point-in-time recovery (PITR) across a configurable retention window.

Retention: The platform uses a 30-day PITR retention window on Neon's paid plan. This means any point in the last 30 days is recoverable. Verify the retention setting in the Neon console under project settings for each of the six projects.

Database Neon project name System domain
bank_core bank-core-prod SD01
bank_kyc bank-kyc-prod SD02
bank_aml bank-aml-prod SD03
bank_payments bank-payments-prod SD04
bank_credit bank-credit-prod SD05
bank_app bank-app-prod SD08

Branch-based protection

The prod branch is the live branch. It is never used as a migration source or test target. The promotion path for schema migrations is:

feature branch → dev branch → uat branch → prod branch

Any schema migration that causes data issues on uat is absorbed before reaching prod. Neon's branching model gives instant zero-copy clones — creating a uat branch from prod costs nothing in storage and takes seconds.

Cross-region snapshot (Neon limitation and mitigation)

Neon does not natively support cross-region replication as of 2026. The entire Neon infrastructure runs in AWS ap-southeast-2 (Sydney). If the Sydney region fails, Neon is also unavailable.

Mitigation: A nightly CDC (change data capture) pipeline via MOD-042 exports all financial records from the six Neon databases to S3 bucket bank-iceberg-prod in Apache Iceberg format. This provides a recoverable snapshot for all financial records, lagging by at most 24 hours, that is independent of Neon's availability.

In a cross-region recovery scenario, Iceberg data can be re-imported into a new Neon project in an alternate region via a migration API job. See the cross-region restore procedure below.

Neon PITR procedure

Use this procedure when data corruption is detected and the target recovery point is within the last 30 days.

Step 1: Identify the target recovery timestamp.

Determine the latest "clean" point before the corruption occurred. Use flyway_schema_history, application logs, and CloudWatch traces to narrow down the time.

Step 2: Create a new branch from the target point in time.

Via the Neon console: - Navigate to the project → Branches → New branch. - Set parent: prod. - Enable "Create branch from a point in time" and set the timestamp. - Create a read-write endpoint on the new branch.

Via the Neon API:

curl -X POST "https://console.neon.tech/api/v2/projects/{project_id}/branches" \
  -H "Authorization: Bearer {neon_api_key}" \
  -H "Content-Type: application/json" \
  -d '{
    "endpoints": [{"type": "read_write"}],
    "branch": {
      "parent_id": "{prod-branch-id}",
      "parent_timestamp": "2026-04-17T23:00:00Z"
    }
  }'

The API returns a new branch ID and endpoint hostname.

Step 3: Verify the recovery branch.

Connect to the new branch with a read-only client. Verify: - Row counts in critical tables match expected values. - The most recent transaction timestamps look correct. - Any specific records known to be corrupted are in their correct pre-corruption state.

Do not proceed to production cutover without completing this verification.

Step 4: Promote the recovery branch to production.

Option A — rename (preferred for minimal downtime): 1. Rename the current prod branch to prod-corrupt-{date} (preserve it temporarily). 2. Rename the recovery branch to prod. 3. The existing connection strings (PgBouncer endpoints) now point to the renamed prod branch — verify the endpoint hostname is unchanged or update Secrets Manager if the hostname changed.

Option B — update connection strings: 1. The new branch has a new endpoint hostname. 2. Update /bank/prod/{domain}/neon-db-url in Secrets Manager with the new connection string. 3. Force Lambda cold starts (see DR runbook Scenario 2).

Step 5: Run reconciliation.

Execute the daily reconciliation job manually. Verify output against expected balances and record counts. Confirm with finance or operations before declaring recovery complete.

Step 6: Clean up.

Delete the prod-corrupt-{date} branch after 7 days (once recovery is confirmed stable) to reclaim storage.

Cross-region restore procedure

Use this only in a confirmed AWS ap-southeast-2 full-region outage.

  1. Check whether Neon is available despite the AWS regional outage. Neon operates its own infrastructure and may not be co-located with your workloads. If Neon's console is reachable and projects are healthy, create new endpoints targeting the alternate region's network and update connection strings in Secrets Manager (ap-southeast-4).

  2. If Neon is also unavailable, restore from the Iceberg snapshots in S3:

  3. The cross-region S3 bucket is bank-iceberg-recovery in ap-southeast-4 (populated by S3 cross-region replication from bank-iceberg-prod).
  4. Create a new Neon project in the recovery context (or use a self-managed Postgres if Neon is unavailable).
  5. Run the Iceberg import job, which reads the Parquet files and re-inserts records via the migration API.
  6. Apply all Flyway migrations on top of the imported data to ensure schema consistency.
  7. Update all connection strings in Secrets Manager (ap-southeast-4).

  8. RPO implication: Iceberg export runs nightly. In a worst-case scenario (region fails at 11:59pm, just before the nightly export), up to 24 hours of transactions since the last Iceberg snapshot may not be in S3. Any transactions in that window must be reconstructed from: (a) payment rail confirmations (NPP, BPAY), (b) customer records, and (c) Snowflake event streams if they were written before the failure.


S3 data backups

bank-iceberg-prod

The primary Iceberg data lake. All financial records from Neon are streamed here nightly via MOD-042.

Setting Value
Versioning Enabled
Lifecycle — transition to Glacier After 90 days
Lifecycle — delete After 7 years (regulatory minimum)
Cross-region replication Replicate to bank-iceberg-recovery in ap-southeast-4
Encryption KMS CMK bank/financial

Recovery: S3 object versioning allows retrieval of any version of any Parquet file. Use the S3 console or:

aws s3api list-object-versions \
  --bucket bank-iceberg-prod \
  --prefix {table-prefix}/

bank-documents-prod

Customer document store — identity documents, statements, signed agreements. Uploaded via KYC and app flows.

Setting Value
Versioning Enabled
Encryption KMS CMK bank/pii
Cross-region replication Replicate to bank-documents-recovery in ap-southeast-4 (configure if not already active)
Lifecycle — delete Per document retention policy; regulatory documents 7 years minimum

Recovery: Version retrieval via S3 API. For mass recovery, use S3 Batch Operations to restore a specific version across all objects in a prefix.

bank-reports-prod

Regulatory report outputs — RBNZ statistical returns, APRA reporting, AUSTRAC TTR/SMR submissions.

Setting Value
Versioning Enabled
Encryption KMS CMK bank/operational
Lifecycle — delete 7-year minimum (regulatory requirement)
Cross-region replication Recommended — configure replication to ap-southeast-4

Recovery: Version retrieval. Regulatory reports are also retained by the regulators themselves — in a worst case, reports can be re-requested from RBNZ/APRA if the S3 copies are lost.


Snowflake backups

Time Travel

Snowflake Time Travel allows any query to be run against historical data up to 90 days in the past (Enterprise tier setting, configured on all Snowflake databases used by SD06 and SD07).

Table-level recovery:

CREATE OR REPLACE TABLE {schema}.{table}_recovered
  CLONE {schema}.{table}
  AT (TIMESTAMP => '{recovery-timestamp}'::TIMESTAMP_TZ);

Verify the clone contains the expected data, then swap:

ALTER TABLE {schema}.{table} SWAP WITH {schema}.{table}_recovered;
DROP TABLE {schema}.{table}_recovered;

Database-level recovery:

CREATE OR REPLACE DATABASE {database}_recovered
  CLONE {database}
  AT (TIMESTAMP => '{recovery-timestamp}'::TIMESTAMP_TZ);

Rename after verification.

Fail-safe

Snowflake Fail-safe provides an additional 7 days of non-self-service recovery beyond the Time Travel window. Fail-safe recovery requires a Snowflake support ticket — it is not accessible via SQL or the Snowflake web UI.

Total Snowflake recovery window: 90 days Time Travel + 7 days Fail-safe = up to 97 days.

Snowflake storage location

Snowflake data for the platform resides in Snowflake's Australia region, which uses AWS ap-southeast-2 as the underlying cloud infrastructure. A full Sydney region outage may affect Snowflake availability. Snowflake's own infrastructure resilience applies — check the Snowflake Status page at status.snowflake.com during any suspected region incident.


AWS Secrets Manager

Secrets are not "backed up" in the traditional sense — they are configuration values that can be re-provisioned from their sources. However, AWS Secrets Manager provides a 7-day deletion recovery window before a deleted secret is permanently destroyed.

Restore a deleted secret (within 7-day window):

aws secretsmanager restore-secret \
  --secret-id /bank/prod/{domain}/{secret-name}

Rotation behaviour: Secret rotation does not delete the old version. The old version is retained with the version stage AWSPREVIOUS until the next rotation cycle. You can retrieve the previous version if a rotation introduced a bad credential:

aws secretsmanager get-secret-value \
  --secret-id /bank/prod/{domain}/{secret-name} \
  --version-stage AWSPREVIOUS

Re-promote the previous version to AWSCURRENT if needed (see DR runbook rotation failure response).

For re-provisioning from scratch (beyond 7-day window), follow the secrets manifest.


AWS AppConfig

AppConfig stores feature flags and runtime configuration. All configuration changes are versioned.

Version history: AppConfig retains all deployed configuration versions indefinitely (there is no automatic purge). Each deployment is a discrete version.

Roll back a configuration deployment:

# List recent deployments
aws appconfig list-deployments \
  --application-id {app-id} \
  --environment-id {env-id}

# Roll back to a specific deployment version
aws appconfig rollback-deployment \
  --application-id {app-id} \
  --environment-id {env-id} \
  --deployment-number {number}

Alternatively, via the AWS console: AppConfig → Application → Environment → Deployments → select previous version → Deploy.


Cognito backups

Current state (action item pending)

Cognito does not provide native point-in-time backup or restore. The standard mitigation — a scheduled nightly Lambda export of user attributes to S3 — has not yet been implemented. This is the highest-severity backup gap in the current platform.

Action item: Implement a nightly Cognito export Lambda. Candidate module: extend MOD-076 (observability) or create a dedicated backup module under SD09. The export should: - Run nightly at 02:00 AEST. - Call aws cognito-idp list-users --user-pool-id {pool_id} for each pool (bank-customers-{env} and bank-staff-{env}). - Write the output (user attributes only — no password hashes) to s3://bank-exports-prod/cognito/{pool-id}/{date}.json. - Apply a 7-year lifecycle rule to the S3 prefix.

Manual export procedure (interim)

Until the automated export is implemented, run this manually before any Cognito configuration change:

aws cognito-idp list-users \
  --user-pool-id {pool-id} \
  --query 'Users[*].{sub: Attributes[?Name==`sub`].Value | [0], email: Attributes[?Name==`email`].Value | [0], given_name: Attributes[?Name==`given_name`].Value | [0], family_name: Attributes[?Name==`family_name`].Value | [0], phone_number: Attributes[?Name==`phone_number`].Value | [0], status: UserStatus, created: UserCreateDate}' \
  --output json > cognito-export-{pool-id}-{date}.json

Upload to s3://bank-exports-prod/cognito/{pool-id}/manual/ immediately.

Cognito recovery

If a user pool is corrupted or deleted:

  1. Re-create the user pool from IaC (SST stack). The pool configuration (password policy, MFA, app clients, Lambda triggers) is fully defined in IaC.
  2. Import user attributes from the most recent S3 export using a Cognito admin import job.
  3. All customers must reset their passwords — passwords cannot be exported or re-imported.
  4. See DR runbook Scenario 5 for the full recovery procedure including customer communications.

Recovery testing schedule

Untested backups are not backups. The following tests are mandatory.

Test Frequency Procedure Pass criteria
Neon PITR test Quarterly Create a new branch from 24 hours prior to test time. Query key tables. Verify row counts match expected. Delete branch. Row counts within 1% of expected; no errors in branch creation
S3 Iceberg export verification Monthly Verify that last night's Iceberg export completed (check S3 object timestamps and CloudWatch Logs for MOD-042). Verify row count of the largest financial table in Parquet against expected. Export completed within 2 hours of start time; row count ≥ previous day
Secrets Manager restore test Quarterly Create a test secret, delete it, restore it, verify it can be read. Restore succeeds; secret value intact
Snowflake Time Travel test Quarterly Clone a non-critical table from 7 days ago. Verify row count. Drop the clone. Clone created successfully; row count matches 7-day-old snapshot
Full DR exercise Annually Simulate a Neon database failure (disconnect all Lambda functions from one database, execute the Scenario 2 recovery procedure). Measure actual RTO. RTO achieved ≤ 4 hours; reconciliation report passes
Cognito export test Monthly (once automated) Verify export Lambda ran successfully; verify S3 object exists with expected user count. Export contains ≥ 95% of expected user count

Test results must be recorded in the incident management system under the "DR Testing" project. Failed tests must generate a remediation action item with an owner and due date.


  • DR runbook — step-by-step recovery for each failure scenario
  • Secrets manifest — secrets re-provisioning reference
  • Provisioning playbook — full environment build for cross-region recovery
  • OPS-002 — Disaster Recovery Policy (governance and ownership of backup testing obligations)