Appearance
ADR-006: Proprietary Rust AI Gateway (replacing LiteLLM)
Status
Accepted — 2026-03-21
Context
Milestone 2 requires a Layer 2 AI Gateway with multi-model routing, semantic caching, PII redaction, token budget enforcement, and SSE streaming. Original plan used Python/LiteLLM.
Options evaluated
- Python + LiteLLM — Mature library, fast to build, but GIL-limited, high release churn, 2-5k req/s ceiling
- Go — Fast, good concurrency, but no mature LLM routing library
- Rust — Fastest, zero-cost abstractions, async-native, aligns with L3 stack
- Rust hot path + Python sidecar — Hybrid, but adds operational complexity
Decision
Build a proprietary AI Gateway entirely in Rust. No Python. No LiteLLM dependency.
Rationale
- Performance: Sub-millisecond cache hits (Redis + Qdrant). 20k+ req/s for cache-hit path. SSE streaming with zero-copy forwarding.
- No dependency risk: LiteLLM has high release churn and breaking changes. Owning the routing logic gives full control.
- Stack alignment: L3 (agentgateway.dev) is also Rust. Single language for L2+L3 reduces cognitive overhead and enables shared libraries.
- Proprietary moat: Custom PII redaction, semantic cache, and budget enforcement tightly integrated — not a wrapper around someone else's library.
- Cache-hit ratio: At 40%+ cache hit rate, 40% of traffic never touches an LLM provider. The gateway itself is the bottleneck for those requests — Rust makes that path sub-millisecond.
Architecture
Request → Rust Gateway
├─ Extract tenant_id (from header or JWT)
├─ Token budget check (Redis)
├─ PII scan (regex + entity detection)
├─ Semantic cache lookup (Redis exact → Qdrant similarity)
├─ [CACHE HIT] → Return cached response (<1ms)
├─ [CACHE MISS] → Forward to LLM provider
│ ├─ OpenAI (reqwest + SSE)
│ ├─ Anthropic (reqwest + SSE)
│ ├─ Gemini (reqwest + SSE)
│ └─ Ollama (reqwest + SSE)
├─ Cache response (Redis + Qdrant)
├─ Deduct token budget (Redis)
└─ Log to ClickHouse (async batch)Consequences
- More upfront development time vs LiteLLM wrapper
- Must implement provider-specific API formats (OpenAI, Anthropic, Gemini, Ollama)
- Must handle provider quirks (token counting, error mapping, streaming formats)
- Team needs Rust proficiency
- Significantly faster and more reliable in production