Redis Resilience Guide

Why Redis is Critical

Redis is the shared state backbone for all three gateway layers:

Layer	Redis Usage	Failure Impact
L1 (APISIX)	Rate limit counters, tenant config lookup	Fails open: requests pass without rate limiting
L2 (AI Gateway)	Response cache, token budgets, semantic cache keys	Degrades: no cache (every request hits LLM), no budget enforcement
L3 (Agent Gateway)	Sessions, policies, HITL queue, A2A state	Fails hard: session operations fail, agent sessions dead
CP API	Tenant metadata, notifications, settings, webhooks	Fails hard: portal API calls fail

Failure Modes & Behavior

Redis Complete Outage

Service	Behavior	Status Code
L1 APISIX	Fails open — all requests pass, no rate limiting	200 + `X-RateLimit-Status: degraded`
L2 AI GW	Degraded — cache miss (LLM called), budget check skipped	200 (health: "degraded")
L3 Agent GW	Unavailable — session CRUD fails	500/503
CP API	Degraded — health shows "degraded"	Varies

Redis High Latency (>100ms)

Rate limit checks add latency directly to L1 P99. At 100ms Redis latency, L1 P99 will be ~120ms+ (exceeds 50ms target).

Mitigation: Redis Sentinel with local read replicas. L1 reads from replica, writes to primary.

Redis Memory Pressure

Configured with maxmemory-policy allkeys-lru. When memory fills:

Oldest cache entries evicted first (correct behavior)
Rate limit counters may be evicted (temporary over-counting on next window)
Session data preserved as long as TTL hasn't expired

Redis Restart

All Rust services use redis::aio::ConnectionManager which auto-reconnects. APISIX Lua plugin creates new connection per request from keepalive pool. Recovery is automatic within 2-3 seconds.

Production Configuration

Redis Sentinel (Recommended)

yaml

# docker-compose.prod.yml
services:
  redis-master:
    image: redis:7-alpine
    command: >
      redis-server
      --maxmemory 2gb
      --maxmemory-policy allkeys-lru
      --requirepass ${REDIS_PASSWORD}
      --appendonly yes
      --appendfsync everysec
    volumes:
      - redis-master-data:/data

  redis-replica:
    image: redis:7-alpine
    command: >
      redis-server
      --replicaof redis-master 6379
      --masterauth ${REDIS_PASSWORD}
      --requirepass ${REDIS_PASSWORD}
      --appendonly yes

  redis-sentinel:
    image: redis:7-alpine
    command: >
      redis-sentinel /etc/sentinel.conf
    volumes:
      - ./infra/redis/sentinel.conf:/etc/sentinel.conf

Sentinel Configuration

conf

# infra/redis/sentinel.conf
sentinel monitor mymaster redis-master 6379 2
sentinel auth-pass mymaster ${REDIS_PASSWORD}
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1

Key Production Settings

Setting	Dev Default	Production	Why
`appendonly`	`no`	`yes`	Data survives restart
`appendfsync`	N/A	`everysec`	1s RPO (balance durability/perf)
`maxmemory`	`512mb`	`2gb+`	Scale for tenant count
`maxmemory-policy`	`allkeys-lru`	`allkeys-lru`	Correct — evicts oldest cache
`save`	`""` (disabled)	`900 1`	RDB snapshot every 15min
`tcp-keepalive`	`300`	`60`	Detect dead connections faster
`timeout`	`0`	`300`	Close idle clients after 5min

Monitoring

Key Metrics (Prometheus)

yaml

# Redis Exporter (add to docker-compose)
redis-exporter:
  image: oliver006/redis_exporter:latest
  environment:
    - REDIS_ADDR=redis://gw-redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}
  ports:
    - "9121:9121"

Alert Rules

Metric	Threshold	Action
`redis_up`	0	Page — all rate limiting disabled
`redis_connected_clients`	> 500	Warn — connection pool leak
`redis_used_memory_bytes / redis_maxmemory`	> 80%	Warn — approaching eviction
`redis_keyspace_misses / (hits + misses)`	> 50%	Investigate — cache thrashing
`redis_latest_fork_duration_seconds`	> 1s	Warn — RDB/AOF fork slow

Chaos Testing

Run the chaos test suite to validate behavior:

bash

./scripts/chaos-redis.sh

This tests: complete outage, recovery, memory pressure, restart under load, data integrity.

Architecture Decision: Why Single Redis?

For the initial enterprise tier (< 50 tenants, < 10k TPS), a single Redis Sentinel cluster is sufficient:

Redis handles 100K+ ops/sec on a single core
All operations are O(1) or O(log N) (ZSET for rate limits)
Memory footprint: ~100 bytes per rate limit window, ~1KB per session

Scale triggers for Redis Cluster:

> 100 tenants with independent rate limit windows
> 50K concurrent sessions in L3
> 10GB memory needed for cache + sessions

Redis Resilience Guide ​

Why Redis is Critical ​

Failure Modes & Behavior ​

Redis Complete Outage ​

Redis High Latency (>100ms) ​

Redis Memory Pressure ​

Redis Restart ​

Production Configuration ​

Redis Sentinel (Recommended) ​

Sentinel Configuration ​

Key Production Settings ​

Monitoring ​

Key Metrics (Prometheus) ​

Alert Rules ​

Chaos Testing ​

Architecture Decision: Why Single Redis? ​