Skip to content

Redis Resilience Guide

Why Redis is Critical

Redis is the shared state backbone for all three gateway layers:

LayerRedis UsageFailure Impact
L1 (APISIX)Rate limit counters, tenant config lookupFails open: requests pass without rate limiting
L2 (AI Gateway)Response cache, token budgets, semantic cache keysDegrades: no cache (every request hits LLM), no budget enforcement
L3 (Agent Gateway)Sessions, policies, HITL queue, A2A stateFails hard: session operations fail, agent sessions dead
CP APITenant metadata, notifications, settings, webhooksFails hard: portal API calls fail

Failure Modes & Behavior

Redis Complete Outage

ServiceBehaviorStatus Code
L1 APISIXFails open — all requests pass, no rate limiting200 + X-RateLimit-Status: degraded
L2 AI GWDegraded — cache miss (LLM called), budget check skipped200 (health: "degraded")
L3 Agent GWUnavailable — session CRUD fails500/503
CP APIDegraded — health shows "degraded"Varies

Redis High Latency (>100ms)

Rate limit checks add latency directly to L1 P99. At 100ms Redis latency, L1 P99 will be ~120ms+ (exceeds 50ms target).

Mitigation: Redis Sentinel with local read replicas. L1 reads from replica, writes to primary.

Redis Memory Pressure

Configured with maxmemory-policy allkeys-lru. When memory fills:

  • Oldest cache entries evicted first (correct behavior)
  • Rate limit counters may be evicted (temporary over-counting on next window)
  • Session data preserved as long as TTL hasn't expired

Redis Restart

All Rust services use redis::aio::ConnectionManager which auto-reconnects. APISIX Lua plugin creates new connection per request from keepalive pool. Recovery is automatic within 2-3 seconds.

Production Configuration

yaml
# docker-compose.prod.yml
services:
  redis-master:
    image: redis:7-alpine
    command: >
      redis-server
      --maxmemory 2gb
      --maxmemory-policy allkeys-lru
      --requirepass ${REDIS_PASSWORD}
      --appendonly yes
      --appendfsync everysec
    volumes:
      - redis-master-data:/data

  redis-replica:
    image: redis:7-alpine
    command: >
      redis-server
      --replicaof redis-master 6379
      --masterauth ${REDIS_PASSWORD}
      --requirepass ${REDIS_PASSWORD}
      --appendonly yes

  redis-sentinel:
    image: redis:7-alpine
    command: >
      redis-sentinel /etc/sentinel.conf
    volumes:
      - ./infra/redis/sentinel.conf:/etc/sentinel.conf

Sentinel Configuration

conf
# infra/redis/sentinel.conf
sentinel monitor mymaster redis-master 6379 2
sentinel auth-pass mymaster ${REDIS_PASSWORD}
sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel parallel-syncs mymaster 1

Key Production Settings

SettingDev DefaultProductionWhy
appendonlynoyesData survives restart
appendfsyncN/Aeverysec1s RPO (balance durability/perf)
maxmemory512mb2gb+Scale for tenant count
maxmemory-policyallkeys-lruallkeys-lruCorrect — evicts oldest cache
save"" (disabled)900 1RDB snapshot every 15min
tcp-keepalive30060Detect dead connections faster
timeout0300Close idle clients after 5min

Monitoring

Key Metrics (Prometheus)

yaml
# Redis Exporter (add to docker-compose)
redis-exporter:
  image: oliver006/redis_exporter:latest
  environment:
    - REDIS_ADDR=redis://gw-redis:6379
    - REDIS_PASSWORD=${REDIS_PASSWORD}
  ports:
    - "9121:9121"

Alert Rules

MetricThresholdAction
redis_up0Page — all rate limiting disabled
redis_connected_clients> 500Warn — connection pool leak
redis_used_memory_bytes / redis_maxmemory> 80%Warn — approaching eviction
redis_keyspace_misses / (hits + misses)> 50%Investigate — cache thrashing
redis_latest_fork_duration_seconds> 1sWarn — RDB/AOF fork slow

Chaos Testing

Run the chaos test suite to validate behavior:

bash
./scripts/chaos-redis.sh

This tests: complete outage, recovery, memory pressure, restart under load, data integrity.

Architecture Decision: Why Single Redis?

For the initial enterprise tier (< 50 tenants, < 10k TPS), a single Redis Sentinel cluster is sufficient:

  • Redis handles 100K+ ops/sec on a single core
  • All operations are O(1) or O(log N) (ZSET for rate limits)
  • Memory footprint: ~100 bytes per rate limit window, ~1KB per session

Scale triggers for Redis Cluster:

  • > 100 tenants with independent rate limit windows
  • > 50K concurrent sessions in L3
  • > 10GB memory needed for cache + sessions

Enterprise API + AI + Agent Gateway