Skip to content

ADR-003: LiteLLM as Sidecar for AI Gateway

Status

Accepted — 2026-03-21

Context

Layer 2 needs to route requests to multiple LLM providers (OpenAI, Anthropic, Gemini, Ollama) with semantic caching, PII redaction, and token budget enforcement.

Options evaluated

  1. LiteLLM sidecar — Python library wrapped in FastAPI, runs alongside APISIX
  2. LiteLLM inline — Embedded directly in APISIX via ext-plugin mechanism
  3. Custom proxy — Build from scratch with httpx + provider-specific SDKs

Decision

Run LiteLLM as a sidecar service behind APISIX, wrapped in a custom FastAPI application.

Rationale

  • Separation of concerns: LLM routing lifecycle (model selection, fallbacks, retries) is independent of API gateway lifecycle (auth, rate limiting, routing).
  • Independent scaling: AI gateway can scale horizontally based on LLM request volume, while APISIX scales based on total API traffic.
  • Python ecosystem: PII redaction (Presidio), semantic caching (Qdrant client), and OTel instrumentation are all Python-native.
  • LiteLLM maturity: Handles provider-specific quirks (token counting, streaming formats, error mapping) that would take months to rebuild.

Semantic cache design

  • Two-tier cache: Redis for exact-match (hash of model + normalized prompt), Qdrant for semantic similarity
  • Invalidation: Cache entries tagged with model version; invalidated on model updates
  • Target: Cache lookup < 5ms at P99

PII redaction

  • Presidio runs BEFORE the request leaves the gateway
  • Only pii_detected=true flag is logged, never the content
  • PII scan target: < 10ms at P99

Consequences

  • Extra network hop between APISIX and AI gateway (~1ms on local network)
  • Python process needs careful async management to avoid event loop blocking
  • Must handle graceful degradation when AI gateway is down (APISIX returns 503)

Enterprise API + AI + Agent Gateway