Appearance
Disaster Recovery Runbook
Recovery Objectives
| Metric | Target | Notes |
|---|---|---|
| RTO (Recovery Time Objective) | < 30 minutes | From decision to restore to service operational |
| RPO (Recovery Point Objective) | < 5 minutes (AOF) / < 1 hour (RDB) | Depends on Redis persistence config |
Backup Schedule
| Service | Data | Frequency | Retention | Script |
|---|---|---|---|---|
| etcd | APISIX routes, plugins, SSL certs | Every 6 hours | 7 days | scripts/backup-etcd.sh |
| ClickHouse | Request logs, AI usage, audit trail | Daily | 90 days | scripts/backup-clickhouse.sh |
| Redis | Rate limits, sessions, budgets, policies | Every 1 hour | 24 hours | scripts/backup-redis.sh |
| Keycloak | Realms, users, clients, roles | Daily | 30 days | Keycloak admin export |
Backup Procedures
Full Backup (All Services)
bash
# Create timestamped backup directory
BACKUP_DIR="./backups/$(date +%Y%m%d_%H%M%S)"
mkdir -p "$BACKUP_DIR"
# Run all backups
./scripts/backup-etcd.sh "$BACKUP_DIR/etcd"
./scripts/backup-clickhouse.sh "$BACKUP_DIR/clickhouse"
./scripts/backup-redis.sh "$BACKUP_DIR/redis"
# Keycloak realm export
docker exec gw-keycloak /opt/keycloak/bin/kc.sh export \
--dir /tmp/keycloak-export --realm gateway 2>/dev/null
docker cp gw-keycloak:/tmp/keycloak-export "$BACKUP_DIR/keycloak"Individual Service Backup
bash
# etcd only
./scripts/backup-etcd.sh
# ClickHouse only
./scripts/backup-clickhouse.sh
# Redis only
./scripts/backup-redis.shRestore Procedures
etcd Restore
bash
# Stop APISIX (it will reconnect automatically)
docker stop apisix
# Restore etcd from snapshot
docker exec etcd etcdctl snapshot restore /tmp/etcd-snapshot.db \
--data-dir=/etcd-data-restored
# Restart etcd with restored data
docker restart etcd
# Restart APISIX
docker start apisix
# Verify routes
curl http://localhost:9180/apisix/admin/routes \
-H "X-API-KEY: $APISIX_ADMIN_KEY"ClickHouse Restore
bash
# Restore schema (if needed)
cat backups/clickhouse/schema-*.sql | \
curl -s -X POST "http://localhost:8123/" --data-binary @-
# Restore data from CSV
for file in backups/clickhouse/*_raw-*.csv.gz; do
TABLE=$(basename "$file" | sed 's/-[0-9].*//; s/_/./g')
zcat "$file" | curl -s -X POST \
"http://localhost:8123/?query=INSERT+INTO+${TABLE}+FORMAT+CSVWithNames" \
--data-binary @-
doneRedis Restore
bash
# Stop Redis
docker stop gw-redis
# Replace dump.rdb
docker cp backups/redis/redis-dump-*.rdb gw-redis:/data/dump.rdb
# Start Redis (loads dump.rdb on startup)
docker start gw-redis
# Verify
docker exec gw-redis redis-cli -a "$REDIS_PASSWORD" DBSIZEKeycloak Restore
bash
# Stop Keycloak
docker stop gw-keycloak
# Import realm
docker exec gw-keycloak /opt/keycloak/bin/kc.sh import \
--dir /tmp/keycloak-export
# Restart Keycloak
docker restart gw-keycloakFailure Scenarios
Scenario: Redis Data Loss
Impact: Rate limit state, active sessions, and token budgets are lost. Users experience temporary rate limit resets and session disconnections.
Recovery:
- Restore Redis from latest backup
- Rate limits self-heal on next request (sliding window recalculates)
- Active agent sessions will need to be recreated by users
- Token budgets will be at backup-time values (may allow slight over-usage)
Scenario: ClickHouse Data Loss
Impact: Analytics dashboards show gaps. Audit trail has gaps. No impact on request processing.
Recovery:
- Restore from CSV backup
- Materialized views will auto-populate on new data
- Document the gap period for compliance audit
Scenario: etcd Data Loss
Impact: APISIX loses all route configuration. All API traffic fails (no routes).
Recovery (CRITICAL — prioritize):
- Restore etcd from snapshot
- Restart APISIX
- If no snapshot available: re-run
scripts/setup-routes.shto recreate routes - Verify all routes with
curlto APISIX admin API
Scenario: Full Cluster Loss
- Provision new infrastructure
- Deploy Gatez via Helm
- Restore etcd (routes)
- Restore Keycloak (users/auth)
- Restore Redis (state)
- Restore ClickHouse (analytics/audit)
- Run smoke test
- Switch DNS