Appearance
ADR-003: LiteLLM as Sidecar for AI Gateway
Status
Accepted — 2026-03-21
Context
Layer 2 needs to route requests to multiple LLM providers (OpenAI, Anthropic, Gemini, Ollama) with semantic caching, PII redaction, and token budget enforcement.
Options evaluated
- LiteLLM sidecar — Python library wrapped in FastAPI, runs alongside APISIX
- LiteLLM inline — Embedded directly in APISIX via ext-plugin mechanism
- Custom proxy — Build from scratch with httpx + provider-specific SDKs
Decision
Run LiteLLM as a sidecar service behind APISIX, wrapped in a custom FastAPI application.
Rationale
- Separation of concerns: LLM routing lifecycle (model selection, fallbacks, retries) is independent of API gateway lifecycle (auth, rate limiting, routing).
- Independent scaling: AI gateway can scale horizontally based on LLM request volume, while APISIX scales based on total API traffic.
- Python ecosystem: PII redaction (Presidio), semantic caching (Qdrant client), and OTel instrumentation are all Python-native.
- LiteLLM maturity: Handles provider-specific quirks (token counting, streaming formats, error mapping) that would take months to rebuild.
Semantic cache design
- Two-tier cache: Redis for exact-match (hash of model + normalized prompt), Qdrant for semantic similarity
- Invalidation: Cache entries tagged with model version; invalidated on model updates
- Target: Cache lookup < 5ms at P99
PII redaction
- Presidio runs BEFORE the request leaves the gateway
- Only
pii_detected=trueflag is logged, never the content - PII scan target: < 10ms at P99
Consequences
- Extra network hop between APISIX and AI gateway (~1ms on local network)
- Python process needs careful async management to avoid event loop blocking
- Must handle graceful degradation when AI gateway is down (APISIX returns 503)