Backing Service Resilience

Resilience Matrix

When a backing service goes down, this is the impact on each layer:

Service Down	L1 (APISIX)	L2 (AI Gateway)	L3 (Agent Gateway)	CP API (Portals)
Redis	OPEN — no rate limiting	DEGRADE — no cache, no budget	FAIL — sessions dead	DEGRADE — tenant ops fail
ClickHouse	OK — logging silently fails	OK — logging silently fails	OK — audit silently fails	DEGRADE — analytics empty
Zitadel	OK — cached routes work	OK — JWKS cache + master key	OK — no auth dependency	DEGRADE — new JWT fails
etcd	OK — in-memory route cache	OK — no dependency	OK — no dependency	OK — no dependency
Qdrant	OK — no dependency	DEGRADE — no semantic cache	OK — no dependency	OK — no dependency
OTel Collector	OK — async non-blocking	OK — async non-blocking	OK — async non-blocking	OK — no dependency
Redis + ClickHouse	OPEN — no rate limit, no log	DEGRADE — cache/budget/log	FAIL	FAIL

Service-by-Service Analysis

1. Redis

Role: Rate limits, cache, sessions, budgets, policies, tenant state, HITL queue

Failure behavior:

L1: Fails open — rate limit plugin logs FAILING OPEN, sets X-RateLimit-Status: degraded header
L2: Cache miss (OK), budget check skipped (OK), health reports degraded
L3: Session CRUD fails → 500 errors
CP API: Tenant operations fail → portal errors

Production hardening: See Redis Resilience Guide

2. ClickHouse

Role: Request logging, AI usage logging, agent audit trail, analytics queries

Failure behavior:

L1 (clickhouse-logger Lua plugin): Batches entries in memory, silently drops on ClickHouse failure. Zero request impact.
L2 (logging::log_to_clickhouse): Uses tokio::spawn fire-and-forget. Error logged, request unaffected.
L3 (audit::log_audit): Same fire-and-forget pattern. Error logged, tool calls unaffected.
CP API analytics: Returns empty data ({"data": []}) with warning log. Does NOT return 500.

Data loss window: Duration of outage. Logs during outage are lost (not queued).

Production hardening:

ClickHouse Keeper (replicated MergeTree) for HA
MergeTree replication across 2-3 nodes
Buffer engine flushes on recovery (already configured)
Consider Kafka/NATS buffer in front of ClickHouse for guaranteed delivery

3. Zitadel

Role: JWT token issuance, SSO login flows, OIDC/SAML federation, user management

Failure behavior:

L1 (APISIX jwt-auth / openid-connect): Validates JWT signature using cached JWKS keys. Short outages (<5 min) transparent. Longer outages: new tokens can't be validated.
L2: JWKS cache (5-min TTL) covers short outages. Master key auth always works. APISIX-proxied requests (with X-Tenant-ID header) always work.
L3: No direct Zitadel dependency (uses headers from L1).
CP API: New JWT introspection fails → 401 on portal API calls. Master key still works.
SSO login: Completely blocked (login page unreachable).

Production hardening:

Zitadel clustered mode (2+ instances with CockroachDB/PostgreSQL)
Database-backed sessions (PostgreSQL)
Increase JWKS cache TTL to 15-30 minutes
Pre-warm JWKS cache on startup
External IdP federation: if Zitadel is down but cached tokens are valid, users stay logged in

4. etcd

Role: APISIX route/plugin/SSL configuration store

Failure behavior:

L1 (APISIX): Routes cached in-memory. Existing routes continue to work. New route changes won't propagate until etcd recovers.
Admin API: Returns 500/503 — cannot create/update/delete routes.
All other layers: No direct etcd dependency.

Production hardening:

etcd cluster (3 or 5 nodes, odd number for quorum)
Already configured: --data-dir persistent volume
etcd snapshot backups (see Disaster Recovery)
etcd compaction: automatic with --auto-compaction-retention=1

5. Qdrant

Role: Vector similarity search for semantic cache (L2 only)

Failure behavior:

L2: semantic_cache::search() returns None on Qdrant failure → falls back to Redis exact-match cache → if miss, calls LLM normally. Zero visible impact, just reduced cache hit rate.
All other layers: No Qdrant dependency.

Production hardening:

Qdrant cluster mode for HA (3+ nodes with replication factor 2)
Qdrant is fully optional — set SEMANTIC_CACHE_ENABLED=false to disable
Data is cache only — full rebuild from scratch is acceptable

6. OTel Collector

Role: Distributed trace collection and forwarding to Jaeger

Failure behavior:

All layers: OTel SDK uses async batch export. Failed exports are silently dropped. Zero request impact.
Loss: Traces during outage are not collected.

Production hardening:

OTel Collector has built-in retry and queue
Deploy 2+ collectors behind a load balancer
Persistent queue: exporters.otlp.sending_queue.storage: file_storage

Cascading Failure Scenarios

Scenario A: Redis + ClickHouse down simultaneously

Most likely cause: Shared storage failure, network partition, host crash

Service	Behavior
L1	Requests pass (fail open), no rate limits, no logging
L2	LLM calls work but: no cache, no budget enforcement, no logging
L3	Dead — all session operations fail
CP API	Dead — tenant/analytics operations fail

Recovery: Redis restart is highest priority (restores L3 + L2 cache). ClickHouse can wait.

Scenario B: Zitadel + Redis down simultaneously

Impact: No auth validation AND no state

All authenticated requests fail
Master key is the only working auth path
Emergency: use master key to perform critical operations

Scenario C: etcd down, then APISIX restart

Impact: APISIX cannot load routes on startup → all L1 traffic fails

This is the highest-risk scenario
Mitigation: etcd cluster (3 nodes), never restart APISIX during etcd outage
Recovery: restore etcd from snapshot, then restart APISIX

Chaos Testing

Run the full chaos suite:

bash

# All backing services (recommended)
./scripts/chaos-all-services.sh

# Redis-specific deep test
./scripts/chaos-redis.sh

Production Checklist

[ ] Redis Sentinel (3 nodes) or Redis Cluster
[ ] Redis AOF persistence enabled (appendonly yes)
[ ] etcd cluster (3 nodes)
[ ] Zitadel clustered (2+ nodes)
[ ] ClickHouse replication (2+ nodes)
[ ] Qdrant replication (optional, cache-only)
[ ] OTel Collector redundancy (2+ nodes)
[ ] Redis Exporter + alerting on redis_up, memory, connections
[ ] ClickHouse monitoring on system.metrics
[ ] Zitadel health endpoint monitored
[ ] etcd health endpoint monitored (/health)
[ ] DR runbook tested quarterly
[ ] Chaos tests run monthly

Backing Service Resilience ​

Resilience Matrix ​

Service-by-Service Analysis ​

1. Redis ​

2. ClickHouse ​

3. Zitadel ​

4. etcd ​

5. Qdrant ​

6. OTel Collector ​

Cascading Failure Scenarios ​

Scenario A: Redis + ClickHouse down simultaneously ​

Scenario B: Zitadel + Redis down simultaneously ​

Scenario C: etcd down, then APISIX restart ​

Chaos Testing ​

Production Checklist ​

Backing Service Resilience

Resilience Matrix

Service-by-Service Analysis

1. Redis

2. ClickHouse

3. Zitadel

4. etcd

5. Qdrant

6. OTel Collector

Cascading Failure Scenarios

Scenario A: Redis + ClickHouse down simultaneously

Scenario B: Zitadel + Redis down simultaneously

Scenario C: etcd down, then APISIX restart

Chaos Testing

Production Checklist