Skip to content

AI Gateway API Reference

The AI Gateway (L2) provides an OpenAI-compatible API for routing LLM requests to multiple providers (OpenAI, Anthropic, Gemini, Ollama). It includes automatic PII redaction, semantic caching, token budget enforcement, and circuit-breaker resilience.

Base URL: http://localhost:4000

All endpoints require authentication via the Authorization header (Bearer JWT) and tenant identification via either the JWT tenant_id claim or the x-tenant-id header.


Health

GET /

Service information and endpoint discovery.

Headers: None required.

Response: 200 OK

json
{
  "service": "ai-gateway",
  "engine": "rust",
  "version": "0.1.0",
  "endpoints": {
    "/health": "Health check",
    "/v1/models": "List available models",
    "/v1/chat/completions": "Chat completions (OpenAI-compatible)",
    "/v1/budget/:tenant_id": "Token budget management",
    "/metrics": "Prometheus metrics"
  }
}

curl Example:

bash
curl http://localhost:4000/

GET /health

Health check endpoint.

Headers: None required.

Response: 200 OK

json
{
  "status": "ok",
  "service": "ai-gateway",
  "engine": "rust"
}

curl Example:

bash
curl http://localhost:4000/health

Models

GET /v1/models

List all available LLM models configured in the gateway.

Headers:

HeaderRequiredDescription
AuthorizationYesBearer {jwt-token}

Response: 200 OK

json
{
  "object": "list",
  "data": [
    {"id": "gpt-4o", "object": "model", "provider": "openai"},
    {"id": "claude-sonnet-4-20250514", "object": "model", "provider": "anthropic"},
    {"id": "gemini-pro", "object": "model", "provider": "gemini"},
    {"id": "llama3", "object": "model", "provider": "ollama"}
  ]
}

curl Example:

bash
curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer $TOKEN"

Chat Completions

POST /v1/chat/completions

Send a chat completion request through the AI Gateway. OpenAI-compatible request and response format.

Headers:

HeaderRequiredDescription
AuthorizationYesBearer {jwt-token}
x-tenant-idYesTenant identifier (also accepted from JWT tenant_id claim)
Content-TypeYesapplication/json
x-request-idNoRequest trace ID for correlation
traceparentNoW3C Trace Context header — propagated to downstream providers

Request Body:

json
{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello"}
  ],
  "stream": false
}
FieldTypeRequiredDescription
modelstringYesModel ID from /v1/models
messagesarrayYesChat messages array with role and content
streambooleanNoEnable SSE streaming (default: false)

Response (non-streaming): 200 OK

json
{
  "id": "chatcmpl-abc123",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 9,
    "total_tokens": 21
  }
}

Response Headers (non-streaming):

HeaderValuesDescription
x-cacheHIT, SEMANTIC_HIT, MISSCache status for the request
x-tenant-idstringEchoed tenant identifier

Response (streaming): 200 OK with Content-Type: text/event-stream

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Error Responses:

StatusTypeDescription
401unauthorizedMissing tenant_id — provide x-tenant-id header or JWT with tenant_id claim
400invalid_request_errorModel not found
429budget_exceededToken budget exceeded for tenant
502provider_errorUpstream LLM provider error (after retries)
503all_providers_unavailableAll providers circuit breaker open

curl Example (non-streaming):

bash
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: tenant-alpha" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

curl Example (streaming):

bash
curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: tenant-alpha" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

:::note PII Redaction: When enabled, PII (SSN, email, credit card, phone, name, address) is automatically detected and redacted from the last message before forwarding to the LLM provider. The gateway logs pii_detected=true but never logs the actual PII content. :::

:::note Caching: Non-streaming requests are cached in two tiers: (1) Redis exact-match cache keyed by {tenant_id}:{model}:{prompt_hash}, and (2) Qdrant semantic similarity cache. Cache hits return the x-cache: HIT or x-cache: SEMANTIC_HIT header. Streaming requests bypass the cache. :::

:::note Budget Enforcement: Token budget is checked before forwarding the request (returns 429 if exceeded). Tokens are deducted from the budget after receiving the provider response. Budget key format: {tenant_id}:budget:tokens. :::

:::note Resilience: Requests are retried with exponential backoff (max 2 retries, 100ms initial delay). A circuit breaker tracks provider failures — when open, the gateway tries fallback providers from the configured fallback chain. If all providers are unavailable, returns 503. :::


Budget

GET /v1/budget/:tenant_id

Get the remaining token budget for a tenant.

Headers:

HeaderRequiredDescription
AuthorizationYesBearer {jwt-token}

Path Parameters:

ParameterDescription
tenant_idTenant identifier

Response (budget set): 200 OK

json
{
  "tenant_id": "tenant-alpha",
  "remaining_tokens": 950000
}

Response (no budget set): 200 OK

json
{
  "tenant_id": "tenant-alpha",
  "remaining_tokens": null,
  "unlimited": true
}

curl Example:

bash
curl http://localhost:4000/v1/budget/tenant-alpha \
  -H "Authorization: Bearer $TOKEN"

POST /v1/budget/:tenant_id

Set the token budget for a tenant.

Headers:

HeaderRequiredDescription
AuthorizationYesBearer {jwt-token}
Content-TypeYesapplication/json

Path Parameters:

ParameterDescription
tenant_idTenant identifier

Request Body:

json
{
  "tokens": 1000000
}
FieldTypeRequiredDescription
tokensintegerYesNumber of tokens to set as budget

Response: 200 OK

json
{
  "tenant_id": "tenant-alpha",
  "tokens_set": 1000000
}

Error Response: 400 Bad Request

json
{
  "error": {
    "message": "Missing or invalid 'tokens' field (must be integer)",
    "type": "invalid_request"
  }
}

curl Example:

bash
curl -X POST http://localhost:4000/v1/budget/tenant-alpha \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tokens": 1000000}'

Metrics

GET /metrics

Prometheus-format metrics endpoint.

Headers: None required.

Response: 200 OK (text/plain, Prometheus exposition format)

# HELP ai_gateway_requests_total Total requests
# TYPE ai_gateway_requests_total counter
ai_gateway_requests_total 1542
# HELP ai_gateway_cache_hits_total Cache hits
# TYPE ai_gateway_cache_hits_total counter
ai_gateway_cache_hits_total 312
# HELP ai_gateway_cache_misses_total Cache misses
# TYPE ai_gateway_cache_misses_total counter
ai_gateway_cache_misses_total 1230
# HELP ai_gateway_pii_detected_total PII detections
# TYPE ai_gateway_pii_detected_total counter
ai_gateway_pii_detected_total 7
# HELP ai_gateway_budget_exceeded_total Budget exceeded events
# TYPE ai_gateway_budget_exceeded_total counter
ai_gateway_budget_exceeded_total 3
# HELP ai_gateway_tokens_total Total tokens processed
# TYPE ai_gateway_tokens_total counter
ai_gateway_tokens_total 482910
# HELP ai_gateway_active_requests Active concurrent requests
# TYPE ai_gateway_active_requests gauge
ai_gateway_active_requests 5
# HELP ai_gateway_latency_seconds Request latency histogram
# TYPE ai_gateway_latency_seconds histogram
ai_gateway_latency_seconds_bucket{le="0.1"} 1400

curl Example:

bash
curl http://localhost:4000/metrics

Endpoint Summary

MethodPathDescription
GET/Service info and endpoint discovery
GET/healthHealth check
GET/metricsPrometheus metrics
GET/v1/modelsList available LLM models
POST/v1/chat/completionsChat completion (OpenAI-compatible)
GET/v1/budget/:tenant_idGet tenant token budget
POST/v1/budget/:tenant_idSet tenant token budget

Enterprise API + AI + Agent Gateway