Skip to content

Backing Service Resilience

Resilience Matrix

When a backing service goes down, this is the impact on each layer:

Service DownL1 (APISIX)L2 (AI Gateway)L3 (Agent Gateway)CP API (Portals)
RedisOPEN — no rate limitingDEGRADE — no cache, no budgetFAIL — sessions deadDEGRADE — tenant ops fail
ClickHouseOK — logging silently failsOK — logging silently failsOK — audit silently failsDEGRADE — analytics empty
KeycloakOK — cached routes workOK — JWKS cache + master keyOK — no auth dependencyDEGRADE — new JWT fails
etcdOK — in-memory route cacheOK — no dependencyOK — no dependencyOK — no dependency
QdrantOK — no dependencyDEGRADE — no semantic cacheOK — no dependencyOK — no dependency
OTel CollectorOK — async non-blockingOK — async non-blockingOK — async non-blockingOK — no dependency
Redis + ClickHouseOPEN — no rate limit, no logDEGRADE — cache/budget/logFAILFAIL

Service-by-Service Analysis

1. Redis

Role: Rate limits, cache, sessions, budgets, policies, tenant state, HITL queue

Failure behavior:

  • L1: Fails open — rate limit plugin logs FAILING OPEN, sets X-RateLimit-Status: degraded header
  • L2: Cache miss (OK), budget check skipped (OK), health reports degraded
  • L3: Session CRUD fails → 500 errors
  • CP API: Tenant operations fail → portal errors

Production hardening: See Redis Resilience Guide


2. ClickHouse

Role: Request logging, AI usage logging, agent audit trail, analytics queries

Failure behavior:

  • L1 (clickhouse-logger Lua plugin): Batches entries in memory, silently drops on ClickHouse failure. Zero request impact.
  • L2 (logging::log_to_clickhouse): Uses tokio::spawn fire-and-forget. Error logged, request unaffected.
  • L3 (audit::log_audit): Same fire-and-forget pattern. Error logged, tool calls unaffected.
  • CP API analytics: Returns empty data ({"data": []}) with warning log. Does NOT return 500.

Data loss window: Duration of outage. Logs during outage are lost (not queued).

Production hardening:

  • ClickHouse Keeper (replicated MergeTree) for HA
  • MergeTree replication across 2-3 nodes
  • Buffer engine flushes on recovery (already configured)
  • Consider Kafka/NATS buffer in front of ClickHouse for guaranteed delivery

3. Keycloak

Role: JWT token issuance, SSO login flows, OIDC/SAML federation, user management

Failure behavior:

  • L1 (APISIX jwt-auth / openid-connect): Validates JWT signature using cached JWKS keys. Short outages (<5 min) transparent. Longer outages: new tokens can't be validated.
  • L2: JWKS cache (5-min TTL) covers short outages. Master key auth always works. APISIX-proxied requests (with X-Tenant-ID header) always work.
  • L3: No direct Keycloak dependency (uses headers from L1).
  • CP API: New JWT introspection fails → 401 on portal API calls. Master key still works.
  • SSO login: Completely blocked (login page unreachable).

Production hardening:

  • Keycloak clustered mode (2+ instances with Infinispan cache)
  • Database-backed sessions (PostgreSQL)
  • Increase JWKS cache TTL to 15-30 minutes
  • Pre-warm JWKS cache on startup
  • External IdP federation: if Keycloak is down but cached tokens are valid, users stay logged in

4. etcd

Role: APISIX route/plugin/SSL configuration store

Failure behavior:

  • L1 (APISIX): Routes cached in-memory. Existing routes continue to work. New route changes won't propagate until etcd recovers.
  • Admin API: Returns 500/503 — cannot create/update/delete routes.
  • All other layers: No direct etcd dependency.

Production hardening:

  • etcd cluster (3 or 5 nodes, odd number for quorum)
  • Already configured: --data-dir persistent volume
  • etcd snapshot backups (see Disaster Recovery)
  • etcd compaction: automatic with --auto-compaction-retention=1

5. Qdrant

Role: Vector similarity search for semantic cache (L2 only)

Failure behavior:

  • L2: semantic_cache::search() returns None on Qdrant failure → falls back to Redis exact-match cache → if miss, calls LLM normally. Zero visible impact, just reduced cache hit rate.
  • All other layers: No Qdrant dependency.

Production hardening:

  • Qdrant cluster mode for HA (3+ nodes with replication factor 2)
  • Qdrant is fully optional — set SEMANTIC_CACHE_ENABLED=false to disable
  • Data is cache only — full rebuild from scratch is acceptable

6. OTel Collector

Role: Distributed trace collection and forwarding to Jaeger

Failure behavior:

  • All layers: OTel SDK uses async batch export. Failed exports are silently dropped. Zero request impact.
  • Loss: Traces during outage are not collected.

Production hardening:

  • OTel Collector has built-in retry and queue
  • Deploy 2+ collectors behind a load balancer
  • Persistent queue: exporters.otlp.sending_queue.storage: file_storage

Cascading Failure Scenarios

Scenario A: Redis + ClickHouse down simultaneously

Most likely cause: Shared storage failure, network partition, host crash

ServiceBehavior
L1Requests pass (fail open), no rate limits, no logging
L2LLM calls work but: no cache, no budget enforcement, no logging
L3Dead — all session operations fail
CP APIDead — tenant/analytics operations fail

Recovery: Redis restart is highest priority (restores L3 + L2 cache). ClickHouse can wait.

Scenario B: Keycloak + Redis down simultaneously

Impact: No auth validation AND no state

  • All authenticated requests fail
  • Master key is the only working auth path
  • Emergency: use master key to perform critical operations

Scenario C: etcd down, then APISIX restart

Impact: APISIX cannot load routes on startup → all L1 traffic fails

  • This is the highest-risk scenario
  • Mitigation: etcd cluster (3 nodes), never restart APISIX during etcd outage
  • Recovery: restore etcd from snapshot, then restart APISIX

Chaos Testing

Run the full chaos suite:

bash
# All backing services (recommended)
./scripts/chaos-all-services.sh

# Redis-specific deep test
./scripts/chaos-redis.sh

Production Checklist

  • [ ] Redis Sentinel (3 nodes) or Redis Cluster
  • [ ] Redis AOF persistence enabled (appendonly yes)
  • [ ] etcd cluster (3 nodes)
  • [ ] Keycloak clustered (2+ nodes)
  • [ ] ClickHouse replication (2+ nodes)
  • [ ] Qdrant replication (optional, cache-only)
  • [ ] OTel Collector redundancy (2+ nodes)
  • [ ] Redis Exporter + alerting on redis_up, memory, connections
  • [ ] ClickHouse monitoring on system.metrics
  • [ ] Keycloak health endpoint monitored
  • [ ] etcd health endpoint monitored (/health)
  • [ ] DR runbook tested quarterly
  • [ ] Chaos tests run monthly

Enterprise API + AI + Agent Gateway