Skip to content

Disaster Recovery Runbook

Recovery Objectives

MetricTargetNotes
RTO (Recovery Time Objective)< 30 minutesFrom decision to restore to service operational
RPO (Recovery Point Objective)< 5 minutes (AOF) / < 1 hour (RDB)Depends on Redis persistence config

Backup Schedule

ServiceDataFrequencyRetentionScript
etcdAPISIX routes, plugins, SSL certsEvery 6 hours7 daysscripts/backup-etcd.sh
ClickHouseRequest logs, AI usage, audit trailDaily90 daysscripts/backup-clickhouse.sh
RedisRate limits, sessions, budgets, policiesEvery 1 hour24 hoursscripts/backup-redis.sh
KeycloakRealms, users, clients, rolesDaily30 daysKeycloak admin export

Backup Procedures

Full Backup (All Services)

bash
# Create timestamped backup directory
BACKUP_DIR="./backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"

# Run all backups
./scripts/backup-etcd.sh "$BACKUP_DIR/etcd"
./scripts/backup-clickhouse.sh "$BACKUP_DIR/clickhouse"
./scripts/backup-redis.sh "$BACKUP_DIR/redis"

# Keycloak realm export
docker exec gw-keycloak /opt/keycloak/bin/kc.sh export \
  --dir /tmp/keycloak-export --realm gateway 2>/dev/null
docker cp gw-keycloak:/tmp/keycloak-export "$BACKUP_DIR/keycloak"

Individual Service Backup

bash
# etcd only
./scripts/backup-etcd.sh

# ClickHouse only
./scripts/backup-clickhouse.sh

# Redis only
./scripts/backup-redis.sh

Restore Procedures

etcd Restore

bash
# Stop APISIX (it will reconnect automatically)
docker stop apisix

# Restore etcd from snapshot
docker exec etcd etcdctl snapshot restore /tmp/etcd-snapshot.db \
  --data-dir=/etcd-data-restored

# Restart etcd with restored data
docker restart etcd

# Restart APISIX
docker start apisix

# Verify routes
curl http://localhost:9180/apisix/admin/routes \
  -H "X-API-KEY: $APISIX_ADMIN_KEY"

ClickHouse Restore

bash
# Restore schema (if needed)
cat backups/clickhouse/schema-*.sql | \
  curl -s -X POST "http://localhost:8123/" --data-binary @-

# Restore data from CSV
for file in backups/clickhouse/*_raw-*.csv.gz; do
  TABLE=$(basename "$file" | sed 's/-[0-9].*//; s/_/./g')
  zcat "$file" | curl -s -X POST \
    "http://localhost:8123/?query=INSERT+INTO+${TABLE}+FORMAT+CSVWithNames" \
    --data-binary @-
done

Redis Restore

bash
# Stop Redis
docker stop gw-redis

# Replace dump.rdb
docker cp backups/redis/redis-dump-*.rdb gw-redis:/data/dump.rdb

# Start Redis (loads dump.rdb on startup)
docker start gw-redis

# Verify
docker exec gw-redis redis-cli -a "$REDIS_PASSWORD" DBSIZE

Keycloak Restore

bash
# Stop Keycloak
docker stop gw-keycloak

# Import realm
docker exec gw-keycloak /opt/keycloak/bin/kc.sh import \
  --dir /tmp/keycloak-export

# Restart Keycloak
docker restart gw-keycloak

Failure Scenarios

Scenario: Redis Data Loss

Impact: Rate limit state, active sessions, and token budgets are lost. Users experience temporary rate limit resets and session disconnections.

Recovery:

  1. Restore Redis from latest backup
  2. Rate limits self-heal on next request (sliding window recalculates)
  3. Active agent sessions will need to be recreated by users
  4. Token budgets will be at backup-time values (may allow slight over-usage)

Scenario: ClickHouse Data Loss

Impact: Analytics dashboards show gaps. Audit trail has gaps. No impact on request processing.

Recovery:

  1. Restore from CSV backup
  2. Materialized views will auto-populate on new data
  3. Document the gap period for compliance audit

Scenario: etcd Data Loss

Impact: APISIX loses all route configuration. All API traffic fails (no routes).

Recovery (CRITICAL — prioritize):

  1. Restore etcd from snapshot
  2. Restart APISIX
  3. If no snapshot available: re-run scripts/setup-routes.sh to recreate routes
  4. Verify all routes with curl to APISIX admin API

Scenario: Full Cluster Loss

  1. Provision new infrastructure
  2. Deploy Gatez via Helm
  3. Restore etcd (routes)
  4. Restore Keycloak (users/auth)
  5. Restore Redis (state)
  6. Restore ClickHouse (analytics/audit)
  7. Run smoke test
  8. Switch DNS

Enterprise API + AI + Agent Gateway