Appearance
Backing Service Resilience
Resilience Matrix
When a backing service goes down, this is the impact on each layer:
| Service Down | L1 (APISIX) | L2 (AI Gateway) | L3 (Agent Gateway) | CP API (Portals) |
|---|---|---|---|---|
| Redis | OPEN — no rate limiting | DEGRADE — no cache, no budget | FAIL — sessions dead | DEGRADE — tenant ops fail |
| ClickHouse | OK — logging silently fails | OK — logging silently fails | OK — audit silently fails | DEGRADE — analytics empty |
| Keycloak | OK — cached routes work | OK — JWKS cache + master key | OK — no auth dependency | DEGRADE — new JWT fails |
| etcd | OK — in-memory route cache | OK — no dependency | OK — no dependency | OK — no dependency |
| Qdrant | OK — no dependency | DEGRADE — no semantic cache | OK — no dependency | OK — no dependency |
| OTel Collector | OK — async non-blocking | OK — async non-blocking | OK — async non-blocking | OK — no dependency |
| Redis + ClickHouse | OPEN — no rate limit, no log | DEGRADE — cache/budget/log | FAIL | FAIL |
Service-by-Service Analysis
1. Redis
Role: Rate limits, cache, sessions, budgets, policies, tenant state, HITL queue
Failure behavior:
- L1: Fails open — rate limit plugin logs
FAILING OPEN, setsX-RateLimit-Status: degradedheader - L2: Cache miss (OK), budget check skipped (OK), health reports
degraded - L3: Session CRUD fails → 500 errors
- CP API: Tenant operations fail → portal errors
Production hardening: See Redis Resilience Guide
2. ClickHouse
Role: Request logging, AI usage logging, agent audit trail, analytics queries
Failure behavior:
- L1 (
clickhouse-loggerLua plugin): Batches entries in memory, silently drops on ClickHouse failure. Zero request impact. - L2 (
logging::log_to_clickhouse): Usestokio::spawnfire-and-forget. Error logged, request unaffected. - L3 (
audit::log_audit): Same fire-and-forget pattern. Error logged, tool calls unaffected. - CP API analytics: Returns empty data (
{"data": []}) with warning log. Does NOT return 500.
Data loss window: Duration of outage. Logs during outage are lost (not queued).
Production hardening:
- ClickHouse Keeper (replicated MergeTree) for HA
- MergeTree replication across 2-3 nodes
- Buffer engine flushes on recovery (already configured)
- Consider Kafka/NATS buffer in front of ClickHouse for guaranteed delivery
3. Keycloak
Role: JWT token issuance, SSO login flows, OIDC/SAML federation, user management
Failure behavior:
- L1 (APISIX
jwt-auth/openid-connect): Validates JWT signature using cached JWKS keys. Short outages (<5 min) transparent. Longer outages: new tokens can't be validated. - L2: JWKS cache (5-min TTL) covers short outages. Master key auth always works. APISIX-proxied requests (with
X-Tenant-IDheader) always work. - L3: No direct Keycloak dependency (uses headers from L1).
- CP API: New JWT introspection fails → 401 on portal API calls. Master key still works.
- SSO login: Completely blocked (login page unreachable).
Production hardening:
- Keycloak clustered mode (2+ instances with Infinispan cache)
- Database-backed sessions (PostgreSQL)
- Increase JWKS cache TTL to 15-30 minutes
- Pre-warm JWKS cache on startup
- External IdP federation: if Keycloak is down but cached tokens are valid, users stay logged in
4. etcd
Role: APISIX route/plugin/SSL configuration store
Failure behavior:
- L1 (APISIX): Routes cached in-memory. Existing routes continue to work. New route changes won't propagate until etcd recovers.
- Admin API: Returns 500/503 — cannot create/update/delete routes.
- All other layers: No direct etcd dependency.
Production hardening:
- etcd cluster (3 or 5 nodes, odd number for quorum)
- Already configured:
--data-dirpersistent volume - etcd snapshot backups (see Disaster Recovery)
- etcd compaction: automatic with
--auto-compaction-retention=1
5. Qdrant
Role: Vector similarity search for semantic cache (L2 only)
Failure behavior:
- L2:
semantic_cache::search()returnsNoneon Qdrant failure → falls back to Redis exact-match cache → if miss, calls LLM normally. Zero visible impact, just reduced cache hit rate. - All other layers: No Qdrant dependency.
Production hardening:
- Qdrant cluster mode for HA (3+ nodes with replication factor 2)
- Qdrant is fully optional — set
SEMANTIC_CACHE_ENABLED=falseto disable - Data is cache only — full rebuild from scratch is acceptable
6. OTel Collector
Role: Distributed trace collection and forwarding to Jaeger
Failure behavior:
- All layers: OTel SDK uses async batch export. Failed exports are silently dropped. Zero request impact.
- Loss: Traces during outage are not collected.
Production hardening:
- OTel Collector has built-in retry and queue
- Deploy 2+ collectors behind a load balancer
- Persistent queue:
exporters.otlp.sending_queue.storage: file_storage
Cascading Failure Scenarios
Scenario A: Redis + ClickHouse down simultaneously
Most likely cause: Shared storage failure, network partition, host crash
| Service | Behavior |
|---|---|
| L1 | Requests pass (fail open), no rate limits, no logging |
| L2 | LLM calls work but: no cache, no budget enforcement, no logging |
| L3 | Dead — all session operations fail |
| CP API | Dead — tenant/analytics operations fail |
Recovery: Redis restart is highest priority (restores L3 + L2 cache). ClickHouse can wait.
Scenario B: Keycloak + Redis down simultaneously
Impact: No auth validation AND no state
- All authenticated requests fail
- Master key is the only working auth path
- Emergency: use master key to perform critical operations
Scenario C: etcd down, then APISIX restart
Impact: APISIX cannot load routes on startup → all L1 traffic fails
- This is the highest-risk scenario
- Mitigation: etcd cluster (3 nodes), never restart APISIX during etcd outage
- Recovery: restore etcd from snapshot, then restart APISIX
Chaos Testing
Run the full chaos suite:
bash
# All backing services (recommended)
./scripts/chaos-all-services.sh
# Redis-specific deep test
./scripts/chaos-redis.shProduction Checklist
- [ ] Redis Sentinel (3 nodes) or Redis Cluster
- [ ] Redis AOF persistence enabled (
appendonly yes) - [ ] etcd cluster (3 nodes)
- [ ] Keycloak clustered (2+ nodes)
- [ ] ClickHouse replication (2+ nodes)
- [ ] Qdrant replication (optional, cache-only)
- [ ] OTel Collector redundancy (2+ nodes)
- [ ] Redis Exporter + alerting on
redis_up, memory, connections - [ ] ClickHouse monitoring on
system.metrics - [ ] Keycloak health endpoint monitored
- [ ] etcd health endpoint monitored (
/health) - [ ] DR runbook tested quarterly
- [ ] Chaos tests run monthly