Horizontal Pod Autoscaling (HPA)

Enable

yaml

# values.yaml
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilization: 70
  targetMemoryUtilization: 80

Per-Service Behavior

Service	Min	Max	Scale-Up	Scale-Down	Notes
AI Gateway	2	10	+3 pods/60s, 30s stabilization	-1 pod/120s, 5min stabilization	Fast scale-up for LLM bursts
Agent Gateway	2	8	+2 pods/60s	-1 pod/120s, 5min stabilization	CPU target only
Control Plane API	2	4	Default	-1 pod, 10min stabilization	Low traffic, HA only

Note: APISIX HPA will be added once APISIX deployment is managed via Helm chart.

Scale-Up Strategy

AI Gateway: Aggressive scale-up (30s window, +3 pods) because LLM traffic bursts are common. Memory-based scaling catches high concurrent connection scenarios.
Agent Gateway: CPU-only scaling. Agent sessions are long-lived, so memory is more predictable.
CP API: Conservative scaling. Low traffic volume, scale is for HA not throughput.

Scale-Down Strategy

All services use 5-minute (300s) stabilization windows for scale-down to prevent thrashing. Control Plane API uses 10 minutes because startup cost is higher (Zitadel connection, Redis connection).

Prerequisites

Kubernetes Metrics Server must be installed (kubectl top pods should work)

For custom metrics (HTTP request rate), install Prometheus Adapter:

bash

helm install prometheus-adapter prometheus-community/prometheus-adapter

Monitoring

bash

# Check HPA status
kubectl get hpa -n gatez

# Watch scaling events
kubectl describe hpa ai-gateway-hpa -n gatez

Horizontal Pod Autoscaling (HPA) ​

Enable ​

Per-Service Behavior ​

Scale-Up Strategy ​

Scale-Down Strategy ​

Prerequisites ​