Appearance
Redis Resilience Guide
Why Redis is Critical
Redis is the shared state backbone for all three gateway layers:
| Layer | Redis Usage | Failure Impact |
|---|---|---|
| L1 (APISIX) | Rate limit counters, tenant config lookup | Fails open: requests pass without rate limiting |
| L2 (AI Gateway) | Response cache, token budgets, semantic cache keys | Degrades: no cache (every request hits LLM), no budget enforcement |
| L3 (Agent Gateway) | Sessions, policies, HITL queue, A2A state | Fails hard: session operations fail, agent sessions dead |
| CP API | Tenant metadata, notifications, settings, webhooks | Fails hard: portal API calls fail |
Failure Modes & Behavior
Redis Complete Outage
| Service | Behavior | Status Code |
|---|---|---|
| L1 APISIX | Fails open — all requests pass, no rate limiting | 200 + X-RateLimit-Status: degraded |
| L2 AI GW | Degraded — cache miss (LLM called), budget check skipped | 200 (health: "degraded") |
| L3 Agent GW | Unavailable — session CRUD fails | 500/503 |
| CP API | Degraded — health shows "degraded" | Varies |
Redis High Latency (>100ms)
Rate limit checks add latency directly to L1 P99. At 100ms Redis latency, L1 P99 will be ~120ms+ (exceeds 50ms target).
Mitigation: Redis Sentinel with local read replicas. L1 reads from replica, writes to primary.
Redis Memory Pressure
Configured with maxmemory-policy allkeys-lru. When memory fills:
- Oldest cache entries evicted first (correct behavior)
- Rate limit counters may be evicted (temporary over-counting on next window)
- Session data preserved as long as TTL hasn't expired
Redis Restart
All Rust services use redis::aio::ConnectionManager which auto-reconnects. APISIX Lua plugin creates new connection per request from keepalive pool. Recovery is automatic within 2-3 seconds.
Production Configuration
Redis Sentinel (Recommended)
yaml
# docker-compose.prod.yml
services:
redis-master:
image: redis:7-alpine
command: >
redis-server
--maxmemory 2gb
--maxmemory-policy allkeys-lru
--requirepass ${REDIS_PASSWORD}
--appendonly yes
--appendfsync everysec
volumes:
- redis-master-data:/data
redis-replica:
image: redis:7-alpine
command: >
redis-server
--replicaof redis-master 6379
--masterauth ${REDIS_PASSWORD}
--requirepass ${REDIS_PASSWORD}
--appendonly yes
redis-sentinel:
image: redis:7-alpine
command: >
redis-sentinel /etc/sentinel.conf
volumes:
- ./infra/redis/sentinel.conf:/etc/sentinel.confSentinel Configuration
conf
# infra/redis/sentinel.conf
sentinel monitor mymaster redis-master 6379 2
sentinel auth-pass mymaster ${REDIS_PASSWORD}
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1Key Production Settings
| Setting | Dev Default | Production | Why |
|---|---|---|---|
appendonly | no | yes | Data survives restart |
appendfsync | N/A | everysec | 1s RPO (balance durability/perf) |
maxmemory | 512mb | 2gb+ | Scale for tenant count |
maxmemory-policy | allkeys-lru | allkeys-lru | Correct — evicts oldest cache |
save | "" (disabled) | 900 1 | RDB snapshot every 15min |
tcp-keepalive | 300 | 60 | Detect dead connections faster |
timeout | 0 | 300 | Close idle clients after 5min |
Monitoring
Key Metrics (Prometheus)
yaml
# Redis Exporter (add to docker-compose)
redis-exporter:
image: oliver006/redis_exporter:latest
environment:
- REDIS_ADDR=redis://gw-redis:6379
- REDIS_PASSWORD=${REDIS_PASSWORD}
ports:
- "9121:9121"Alert Rules
| Metric | Threshold | Action |
|---|---|---|
redis_up | 0 | Page — all rate limiting disabled |
redis_connected_clients | > 500 | Warn — connection pool leak |
redis_used_memory_bytes / redis_maxmemory | > 80% | Warn — approaching eviction |
redis_keyspace_misses / (hits + misses) | > 50% | Investigate — cache thrashing |
redis_latest_fork_duration_seconds | > 1s | Warn — RDB/AOF fork slow |
Chaos Testing
Run the chaos test suite to validate behavior:
bash
./scripts/chaos-redis.shThis tests: complete outage, recovery, memory pressure, restart under load, data integrity.
Architecture Decision: Why Single Redis?
For the initial enterprise tier (< 50 tenants, < 10k TPS), a single Redis Sentinel cluster is sufficient:
- Redis handles 100K+ ops/sec on a single core
- All operations are O(1) or O(log N) (ZSET for rate limits)
- Memory footprint: ~100 bytes per rate limit window, ~1KB per session
Scale triggers for Redis Cluster:
- > 100 tenants with independent rate limit windows
- > 50K concurrent sessions in L3
- > 10GB memory needed for cache + sessions