ADR-003: LiteLLM as Sidecar for AI Gateway

Status

Accepted — 2026-03-21

Context

Layer 2 needs to route requests to multiple LLM providers (OpenAI, Anthropic, Gemini, Ollama) with semantic caching, PII redaction, and token budget enforcement.

Options evaluated

LiteLLM sidecar — Python library wrapped in FastAPI, runs alongside APISIX
LiteLLM inline — Embedded directly in APISIX via ext-plugin mechanism
Custom proxy — Build from scratch with httpx + provider-specific SDKs

Decision

Run LiteLLM as a sidecar service behind APISIX, wrapped in a custom FastAPI application.

Rationale

Separation of concerns: LLM routing lifecycle (model selection, fallbacks, retries) is independent of API gateway lifecycle (auth, rate limiting, routing).
Independent scaling: AI gateway can scale horizontally based on LLM request volume, while APISIX scales based on total API traffic.
Python ecosystem: PII redaction (Presidio), semantic caching (Qdrant client), and OTel instrumentation are all Python-native.
LiteLLM maturity: Handles provider-specific quirks (token counting, streaming formats, error mapping) that would take months to rebuild.

Semantic cache design

Two-tier cache: Redis for exact-match (hash of model + normalized prompt), Qdrant for semantic similarity
Invalidation: Cache entries tagged with model version; invalidated on model updates
Target: Cache lookup < 5ms at P99

PII redaction

Presidio runs BEFORE the request leaves the gateway
Only pii_detected=true flag is logged, never the content
PII scan target: < 10ms at P99

Consequences

Extra network hop between APISIX and AI gateway (~1ms on local network)
Python process needs careful async management to avoid event loop blocking
Must handle graceful degradation when AI gateway is down (APISIX returns 503)

ADR-003: LiteLLM as Sidecar for AI Gateway ​

Status ​

Context ​

Options evaluated ​

Decision ​

Rationale ​

Semantic cache design ​

PII redaction ​

Consequences ​