Appearance
AI Gateway API Reference
The AI Gateway (L2) provides an OpenAI-compatible API for routing LLM requests to multiple providers (OpenAI, Anthropic, Gemini, Ollama). It includes automatic PII redaction, semantic caching, token budget enforcement, and circuit-breaker resilience.
Base URL: http://localhost:4000
All endpoints require authentication via the Authorization header (Bearer JWT) and tenant identification via either the JWT tenant_id claim or the x-tenant-id header.
Health
GET /
Service information and endpoint discovery.
Headers: None required.
Response: 200 OK
json
{
"service": "ai-gateway",
"engine": "rust",
"version": "0.1.0",
"endpoints": {
"/health": "Health check",
"/v1/models": "List available models",
"/v1/chat/completions": "Chat completions (OpenAI-compatible)",
"/v1/budget/:tenant_id": "Token budget management",
"/metrics": "Prometheus metrics"
}
}curl Example:
bash
curl http://localhost:4000/GET /health
Health check endpoint.
Headers: None required.
Response: 200 OK
json
{
"status": "ok",
"service": "ai-gateway",
"engine": "rust"
}curl Example:
bash
curl http://localhost:4000/healthModels
GET /v1/models
List all available LLM models configured in the gateway.
Headers:
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer {jwt-token} |
Response: 200 OK
json
{
"object": "list",
"data": [
{"id": "gpt-4o", "object": "model", "provider": "openai"},
{"id": "claude-sonnet-4-20250514", "object": "model", "provider": "anthropic"},
{"id": "gemini-pro", "object": "model", "provider": "gemini"},
{"id": "llama3", "object": "model", "provider": "ollama"}
]
}curl Example:
bash
curl http://localhost:4000/v1/models \
-H "Authorization: Bearer $TOKEN"Chat Completions
POST /v1/chat/completions
Send a chat completion request through the AI Gateway. OpenAI-compatible request and response format.
Headers:
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer {jwt-token} |
x-tenant-id | Yes | Tenant identifier (also accepted from JWT tenant_id claim) |
Content-Type | Yes | application/json |
x-request-id | No | Request trace ID for correlation |
traceparent | No | W3C Trace Context header — propagated to downstream providers |
Request Body:
json
{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello"}
],
"stream": false
}| Field | Type | Required | Description |
|---|---|---|---|
model | string | Yes | Model ID from /v1/models |
messages | array | Yes | Chat messages array with role and content |
stream | boolean | No | Enable SSE streaming (default: false) |
Response (non-streaming): 200 OK
json
{
"id": "chatcmpl-abc123",
"choices": [
{
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
}
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 9,
"total_tokens": 21
}
}Response Headers (non-streaming):
| Header | Values | Description |
|---|---|---|
x-cache | HIT, SEMANTIC_HIT, MISS | Cache status for the request |
x-tenant-id | string | Echoed tenant identifier |
Response (streaming): 200 OK with Content-Type: text/event-stream
data: {"choices":[{"delta":{"content":"Hello"}}]}
data: {"choices":[{"delta":{"content":"!"}}]}
data: [DONE]Error Responses:
| Status | Type | Description |
|---|---|---|
401 | unauthorized | Missing tenant_id — provide x-tenant-id header or JWT with tenant_id claim |
400 | invalid_request_error | Model not found |
429 | budget_exceeded | Token budget exceeded for tenant |
502 | provider_error | Upstream LLM provider error (after retries) |
503 | all_providers_unavailable | All providers circuit breaker open |
curl Example (non-streaming):
bash
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "x-tenant-id: tenant-alpha" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}]
}'curl Example (streaming):
bash
curl -X POST http://localhost:4000/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "x-tenant-id: tenant-alpha" \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}':::note PII Redaction: When enabled, PII (SSN, email, credit card, phone, name, address) is automatically detected and redacted from the last message before forwarding to the LLM provider. The gateway logs pii_detected=true but never logs the actual PII content. :::
:::note Caching: Non-streaming requests are cached in two tiers: (1) Redis exact-match cache keyed by {tenant_id}:{model}:{prompt_hash}, and (2) Qdrant semantic similarity cache. Cache hits return the x-cache: HIT or x-cache: SEMANTIC_HIT header. Streaming requests bypass the cache. :::
:::note Budget Enforcement: Token budget is checked before forwarding the request (returns 429 if exceeded). Tokens are deducted from the budget after receiving the provider response. Budget key format: {tenant_id}:budget:tokens. :::
:::note Resilience: Requests are retried with exponential backoff (max 2 retries, 100ms initial delay). A circuit breaker tracks provider failures — when open, the gateway tries fallback providers from the configured fallback chain. If all providers are unavailable, returns 503. :::
Budget
GET /v1/budget/:tenant_id
Get the remaining token budget for a tenant.
Headers:
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer {jwt-token} |
Path Parameters:
| Parameter | Description |
|---|---|
tenant_id | Tenant identifier |
Response (budget set): 200 OK
json
{
"tenant_id": "tenant-alpha",
"remaining_tokens": 950000
}Response (no budget set): 200 OK
json
{
"tenant_id": "tenant-alpha",
"remaining_tokens": null,
"unlimited": true
}curl Example:
bash
curl http://localhost:4000/v1/budget/tenant-alpha \
-H "Authorization: Bearer $TOKEN"POST /v1/budget/:tenant_id
Set the token budget for a tenant.
Headers:
| Header | Required | Description |
|---|---|---|
Authorization | Yes | Bearer {jwt-token} |
Content-Type | Yes | application/json |
Path Parameters:
| Parameter | Description |
|---|---|
tenant_id | Tenant identifier |
Request Body:
json
{
"tokens": 1000000
}| Field | Type | Required | Description |
|---|---|---|---|
tokens | integer | Yes | Number of tokens to set as budget |
Response: 200 OK
json
{
"tenant_id": "tenant-alpha",
"tokens_set": 1000000
}Error Response: 400 Bad Request
json
{
"error": {
"message": "Missing or invalid 'tokens' field (must be integer)",
"type": "invalid_request"
}
}curl Example:
bash
curl -X POST http://localhost:4000/v1/budget/tenant-alpha \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"tokens": 1000000}'Metrics
GET /metrics
Prometheus-format metrics endpoint.
Headers: None required.
Response: 200 OK (text/plain, Prometheus exposition format)
# HELP ai_gateway_requests_total Total requests
# TYPE ai_gateway_requests_total counter
ai_gateway_requests_total 1542
# HELP ai_gateway_cache_hits_total Cache hits
# TYPE ai_gateway_cache_hits_total counter
ai_gateway_cache_hits_total 312
# HELP ai_gateway_cache_misses_total Cache misses
# TYPE ai_gateway_cache_misses_total counter
ai_gateway_cache_misses_total 1230
# HELP ai_gateway_pii_detected_total PII detections
# TYPE ai_gateway_pii_detected_total counter
ai_gateway_pii_detected_total 7
# HELP ai_gateway_budget_exceeded_total Budget exceeded events
# TYPE ai_gateway_budget_exceeded_total counter
ai_gateway_budget_exceeded_total 3
# HELP ai_gateway_tokens_total Total tokens processed
# TYPE ai_gateway_tokens_total counter
ai_gateway_tokens_total 482910
# HELP ai_gateway_active_requests Active concurrent requests
# TYPE ai_gateway_active_requests gauge
ai_gateway_active_requests 5
# HELP ai_gateway_latency_seconds Request latency histogram
# TYPE ai_gateway_latency_seconds histogram
ai_gateway_latency_seconds_bucket{le="0.1"} 1400curl Example:
bash
curl http://localhost:4000/metricsEndpoint Summary
| Method | Path | Description |
|---|---|---|
| GET | / | Service info and endpoint discovery |
| GET | /health | Health check |
| GET | /metrics | Prometheus metrics |
| GET | /v1/models | List available LLM models |
| POST | /v1/chat/completions | Chat completion (OpenAI-compatible) |
| GET | /v1/budget/:tenant_id | Get tenant token budget |
| POST | /v1/budget/:tenant_id | Set tenant token budget |