AI Gateway API Reference

The AI Gateway (L2) provides an OpenAI-compatible API for routing LLM requests to multiple providers (OpenAI, Anthropic, Gemini, Ollama). It includes automatic PII redaction, semantic caching, token budget enforcement, and circuit-breaker resilience.

Base URL: http://localhost:4000

All endpoints require authentication via the Authorization header (Bearer JWT) and tenant identification via either the JWT tenant_id claim or the x-tenant-id header.

Health

GET /

Service information and endpoint discovery.

Headers: None required.

Response: 200 OK

json

{
  "service": "ai-gateway",
  "engine": "rust",
  "version": "0.1.0",
  "endpoints": {
    "/health": "Health check",
    "/v1/models": "List available models",
    "/v1/chat/completions": "Chat completions (OpenAI-compatible)",
    "/v1/budget/:tenant_id": "Token budget management",
    "/metrics": "Prometheus metrics"
  }
}

curl Example:

bash

curl http://localhost:4000/

GET /health

Health check endpoint.

Headers: None required.

Response: 200 OK

json

{
  "status": "ok",
  "service": "ai-gateway",
  "engine": "rust"
}

curl Example:

bash

curl http://localhost:4000/health

Models

GET /v1/models

List all available LLM models configured in the gateway.

Headers:

Header	Required	Description
`Authorization`	Yes	`Bearer {jwt-token}`

Response: 200 OK

json

{
  "object": "list",
  "data": [
    {"id": "gpt-4o", "object": "model", "provider": "openai"},
    {"id": "claude-sonnet-4-20250514", "object": "model", "provider": "anthropic"},
    {"id": "gemini-pro", "object": "model", "provider": "gemini"},
    {"id": "llama3", "object": "model", "provider": "ollama"}
  ]
}

curl Example:

bash

curl http://localhost:4000/v1/models \
  -H "Authorization: Bearer $TOKEN"

Chat Completions

POST /v1/chat/completions

Send a chat completion request through the AI Gateway. OpenAI-compatible request and response format.

Headers:

Header	Required	Description
`Authorization`	Yes	`Bearer {jwt-token}`
`x-tenant-id`	Yes	Tenant identifier (also accepted from JWT `tenant_id` claim)
`Content-Type`	Yes	`application/json`
`x-request-id`	No	Request trace ID for correlation
`traceparent`	No	W3C Trace Context header — propagated to downstream providers

Request Body:

json

{
  "model": "gpt-4o",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello"}
  ],
  "stream": false
}

Field	Type	Required	Description
`model`	string	Yes	Model ID from `/v1/models`
`messages`	array	Yes	Chat messages array with `role` and `content`
`stream`	boolean	No	Enable SSE streaming (default: `false`)

Response (non-streaming): 200 OK

json

{
  "id": "chatcmpl-abc123",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      }
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 9,
    "total_tokens": 21
  }
}

Response Headers (non-streaming):

Header	Values	Description
`x-cache`	`HIT`, `SEMANTIC_HIT`, `MISS`	Cache status for the request
`x-tenant-id`	string	Echoed tenant identifier

Response (streaming): 200 OK with Content-Type: text/event-stream

data: {"choices":[{"delta":{"content":"Hello"}}]}

data: {"choices":[{"delta":{"content":"!"}}]}

data: [DONE]

Error Responses:

Status	Type	Description
`401`	`unauthorized`	Missing `tenant_id` — provide `x-tenant-id` header or JWT with `tenant_id` claim
`400`	`invalid_request_error`	Model not found
`429`	`budget_exceeded`	Token budget exceeded for tenant
`502`	`provider_error`	Upstream LLM provider error (after retries)
`503`	`all_providers_unavailable`	All providers circuit breaker open

curl Example (non-streaming):

bash

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: tenant-alpha" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

curl Example (streaming):

bash

curl -X POST http://localhost:4000/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "x-tenant-id: tenant-alpha" \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": true
  }'

:::note PII Redaction: When enabled, PII (SSN, email, credit card, phone, name, address) is automatically detected and redacted from the last message before forwarding to the LLM provider. The gateway logs pii_detected=true but never logs the actual PII content. :::

:::note Caching: Non-streaming requests are cached in two tiers: (1) Redis exact-match cache keyed by {tenant_id}:{model}:{prompt_hash}, and (2) Qdrant semantic similarity cache. Cache hits return the x-cache: HIT or x-cache: SEMANTIC_HIT header. Streaming requests bypass the cache. :::

:::note Budget Enforcement: Token budget is checked before forwarding the request (returns 429 if exceeded). Tokens are deducted from the budget after receiving the provider response. Budget key format: {tenant_id}:budget:tokens. :::

:::note Resilience: Requests are retried with exponential backoff (max 2 retries, 100ms initial delay). A circuit breaker tracks provider failures — when open, the gateway tries fallback providers from the configured fallback chain. If all providers are unavailable, returns 503. :::

Budget

GET /v1/budget/:tenant_id

Get the remaining token budget for a tenant.

Headers:

Header	Required	Description
`Authorization`	Yes	`Bearer {jwt-token}`

Path Parameters:

Parameter	Description
`tenant_id`	Tenant identifier

Response (budget set): 200 OK

json

{
  "tenant_id": "tenant-alpha",
  "remaining_tokens": 950000
}

Response (no budget set): 200 OK

json

{
  "tenant_id": "tenant-alpha",
  "remaining_tokens": null,
  "unlimited": true
}

curl Example:

bash

curl http://localhost:4000/v1/budget/tenant-alpha \
  -H "Authorization: Bearer $TOKEN"

POST /v1/budget/:tenant_id

Set the token budget for a tenant.

Headers:

Header	Required	Description
`Authorization`	Yes	`Bearer {jwt-token}`
`Content-Type`	Yes	`application/json`

Path Parameters:

Parameter	Description
`tenant_id`	Tenant identifier

Request Body:

json

{
  "tokens": 1000000
}

Field	Type	Required	Description
`tokens`	integer	Yes	Number of tokens to set as budget

Response: 200 OK

json

{
  "tenant_id": "tenant-alpha",
  "tokens_set": 1000000
}

Error Response: 400 Bad Request

json

{
  "error": {
    "message": "Missing or invalid 'tokens' field (must be integer)",
    "type": "invalid_request"
  }
}

curl Example:

bash

curl -X POST http://localhost:4000/v1/budget/tenant-alpha \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"tokens": 1000000}'

Metrics

GET /metrics

Prometheus-format metrics endpoint.

Headers: None required.

Response: 200 OK (text/plain, Prometheus exposition format)

# HELP ai_gateway_requests_total Total requests
# TYPE ai_gateway_requests_total counter
ai_gateway_requests_total 1542
# HELP ai_gateway_cache_hits_total Cache hits
# TYPE ai_gateway_cache_hits_total counter
ai_gateway_cache_hits_total 312
# HELP ai_gateway_cache_misses_total Cache misses
# TYPE ai_gateway_cache_misses_total counter
ai_gateway_cache_misses_total 1230
# HELP ai_gateway_pii_detected_total PII detections
# TYPE ai_gateway_pii_detected_total counter
ai_gateway_pii_detected_total 7
# HELP ai_gateway_budget_exceeded_total Budget exceeded events
# TYPE ai_gateway_budget_exceeded_total counter
ai_gateway_budget_exceeded_total 3
# HELP ai_gateway_tokens_total Total tokens processed
# TYPE ai_gateway_tokens_total counter
ai_gateway_tokens_total 482910
# HELP ai_gateway_active_requests Active concurrent requests
# TYPE ai_gateway_active_requests gauge
ai_gateway_active_requests 5
# HELP ai_gateway_latency_seconds Request latency histogram
# TYPE ai_gateway_latency_seconds histogram
ai_gateway_latency_seconds_bucket{le="0.1"} 1400

curl Example:

bash

curl http://localhost:4000/metrics

Endpoint Summary

Method	Path	Description
GET	`/`	Service info and endpoint discovery
GET	`/health`	Health check
GET	`/metrics`	Prometheus metrics
GET	`/v1/models`	List available LLM models
POST	`/v1/chat/completions`	Chat completion (OpenAI-compatible)
GET	`/v1/budget/:tenant_id`	Get tenant token budget
POST	`/v1/budget/:tenant_id`	Set tenant token budget

AI Gateway API Reference ​

Health ​

GET / ​

GET /health ​

Models ​

GET /v1/models ​

Chat Completions ​

POST /v1/chat/completions ​

Budget ​

GET /v1/budget/:tenant_id ​

POST /v1/budget/:tenant_id ​

Metrics ​

GET /metrics ​

Endpoint Summary ​

AI Gateway API Reference

Health

GET /

GET /health

Models

GET /v1/models

Chat Completions

POST /v1/chat/completions

Budget

GET /v1/budget/:tenant_id

POST /v1/budget/:tenant_id

Metrics

GET /metrics

Endpoint Summary