Appearance
Horizontal Pod Autoscaling (HPA)
Enable
yaml
# values.yaml
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilization: 70
targetMemoryUtilization: 80Per-Service Behavior
| Service | Min | Max | Scale-Up | Scale-Down | Notes |
|---|---|---|---|---|---|
| AI Gateway | 2 | 10 | +3 pods/60s, 30s stabilization | -1 pod/120s, 5min stabilization | Fast scale-up for LLM bursts |
| Agent Gateway | 2 | 8 | +2 pods/60s | -1 pod/120s, 5min stabilization | CPU target only |
| Control Plane API | 2 | 4 | Default | -1 pod, 10min stabilization | Low traffic, HA only |
Note: APISIX HPA will be added once APISIX deployment is managed via Helm chart.
Scale-Up Strategy
- AI Gateway: Aggressive scale-up (30s window, +3 pods) because LLM traffic bursts are common. Memory-based scaling catches high concurrent connection scenarios.
- Agent Gateway: CPU-only scaling. Agent sessions are long-lived, so memory is more predictable.
- CP API: Conservative scaling. Low traffic volume, scale is for HA not throughput.
Scale-Down Strategy
All services use 5-minute (300s) stabilization windows for scale-down to prevent thrashing. Control Plane API uses 10 minutes because startup cost is higher (Keycloak connection, Redis connection).
Prerequisites
- Kubernetes Metrics Server must be installed (
kubectl top podsshould work) - For custom metrics (HTTP request rate), install Prometheus Adapter:bash
helm install prometheus-adapter prometheus-community/prometheus-adapter
Monitoring
bash
# Check HPA status
kubectl get hpa -n gatez
# Watch scaling events
kubectl describe hpa ai-gateway-hpa -n gatez