Skip to content

Horizontal Pod Autoscaling (HPA)

Enable

yaml
# values.yaml
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70
  targetMemoryUtilization: 80

Per-Service Behavior

ServiceMinMaxScale-UpScale-DownNotes
AI Gateway210+3 pods/60s, 30s stabilization-1 pod/120s, 5min stabilizationFast scale-up for LLM bursts
Agent Gateway28+2 pods/60s-1 pod/120s, 5min stabilizationCPU target only
Control Plane API24Default-1 pod, 10min stabilizationLow traffic, HA only

Note: APISIX HPA will be added once APISIX deployment is managed via Helm chart.

Scale-Up Strategy

  • AI Gateway: Aggressive scale-up (30s window, +3 pods) because LLM traffic bursts are common. Memory-based scaling catches high concurrent connection scenarios.
  • Agent Gateway: CPU-only scaling. Agent sessions are long-lived, so memory is more predictable.
  • CP API: Conservative scaling. Low traffic volume, scale is for HA not throughput.

Scale-Down Strategy

All services use 5-minute (300s) stabilization windows for scale-down to prevent thrashing. Control Plane API uses 10 minutes because startup cost is higher (Keycloak connection, Redis connection).

Prerequisites

  • Kubernetes Metrics Server must be installed (kubectl top pods should work)
  • For custom metrics (HTTP request rate), install Prometheus Adapter:
    bash
    helm install prometheus-adapter prometheus-community/prometheus-adapter

Monitoring

bash
# Check HPA status
kubectl get hpa -n gatez

# Watch scaling events
kubectl describe hpa ai-gateway-hpa -n gatez

Enterprise API + AI + Agent Gateway