Skip to content

Troubleshooting Guide

This guide covers common issues encountered during Gatez deployment and operation, with root causes and fixes derived from real production debugging.


Portal Issues

"Could not load configuration" on tenant drawer tabs

Symptoms: Prompt Config, Usage, Providers, or Guards tabs show "Could not load configuration. Check your connection or try again."

Root cause: The page uses a hardcoded API URL (like VITE_CP_API_URL) that falls back to localhost:4001, which is unreachable in production (ports are not exposed).

Fix: All API calls must use apiFetch() from @gatez/api-client with relative paths. The apiFetch function prepends VITE_API_URL automatically.

typescript
// Wrong
const resp = await fetch(`${CP_API}/api/tenants/${id}/prompt-config`);

// Correct
const resp = await apiFetch(`/api/tenants/${id}/prompt-config`);

Portal opens without Keycloak login redirect

Symptoms: Developer or operator portal loads directly without redirecting to Keycloak login page.

Root cause: VITE_KEYCLOAK_URL is empty in the JS bundle. This happens when the portal Docker image was built without passing the build arg, or when BuildKit cached an old layer.

Verify:

bash
# Check if keycloak URL is in the bundle
docker exec operator-portal grep -c "keycloak" /usr/share/nginx/html/assets/*.js
# Should return > 0

Fix: Rebuild portal with explicit build args:

bash
docker build --no-cache \
  --build-arg VITE_KEYCLOAK_URL=https://keycloak.yourdomain.com \
  --build-arg VITE_KEYCLOAK_REALM=gateway \
  --build-arg VITE_API_URL=https://api.yourdomain.com \
  -f layers/control-plane/apps/operator/Dockerfile \
  -t gatez-operator-portal:latest \
  layers/control-plane/

Dark mode resets on page refresh

Symptoms: Theme reverts to light mode after refreshing.

Root cause: Theme preference was not persisted to localStorage.

Fix: Already fixed in the codebase. The theme is saved to localStorage.setItem('gatez-theme', theme) and restored on app startup before React renders.

Portal shows "unhealthy" in Docker

Symptoms: docker ps shows operator-portal or developer-portal as (unhealthy).

Root cause: The Docker healthcheck uses wget -qO- http://localhost:80/ but the portal serves correctly via Caddy externally.

Impact: Cosmetic only. The portal works fine when accessed via its external URL.

Verify:

bash
curl -s -o /dev/null -w "%{http_code}" https://operator.yourdomain.com
# Should return 200

APISIX Issues

APISIX returns 400 when creating a route with a plugin

Symptoms: PUT /apisix/admin/routes/{id} returns HTTP 400.

Root cause: The plugin configuration doesn't match the plugin's JSON schema. Every APISIX plugin validates its config on route creation.

Common examples:

  • limit-count requires count and time_window (not just {})
  • ip-restriction requires either whitelist or blacklist
  • traffic-split requires rules with weighted_upstreams

Fix: Provide valid config. See APISIX Plugin Reference for each plugin's required fields.

Custom plugin not loadable on routes

Symptoms: Route creation with clickhouse-logger, group-auth, or response-pii-scrub returns 400 even with valid config.

Root cause: The extra_lua_path in apisix.yaml doesn't match the Docker volume mount path for custom plugins.

Verify:

bash
# Check where plugins are mounted
docker inspect apisix --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' | grep plugin

# Check extra_lua_path in config
docker exec apisix cat /usr/local/apisix/conf/config.yaml | grep extra_lua_path

Fix: Ensure the volume mount paths match the extra_lua_path pattern in apisix.yaml.

APISIX admin API not reachable in production

Symptoms: curl localhost:9180/apisix/admin/routes returns "Connection refused".

Root cause: The production overlay (docker-compose.prod.yml) strips all ports with ports: !reset []. APISIX admin is only accessible within the Docker network.

Fix: Use docker exec to access from inside the network:

bash
docker exec apisix curl -s http://localhost:9180/apisix/admin/routes \
  -H "X-API-KEY: your-admin-key"

Or access via the CP API which queries APISIX internally.


Deploy Issues

Container name conflict ("already in use")

Symptoms: Deploy fails with "The container name /operator-portal is already in use".

Root cause: A manually-created container (docker run --name operator-portal) conflicts with a compose-managed container of the same name.

Fix:

bash
docker rm -f operator-portal developer-portal
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

deploy_only guard fails

Symptoms: "No successful CI run found for SHA" when using deploy_only=true.

Root cause: The auto-fix agent pushed a commit after the last CI passed, creating a new SHA with no green CI.

Fix: Run a full pipeline (not deploy_only):

bash
gh workflow run ci.yml --ref main
# Not: gh workflow run ci.yml --ref main -f deploy_only=true

GHCR pull returns 403

Symptoms: "403 Forbidden" when pulling images from ghcr.io on the VM.

Root cause: Private repository images require authentication. The GITHUB_TOKEN in CI may not have read:packages scope when used from the VM.

Impact: The deploy script falls back to building locally (docker compose build), which is slower but works.

Fix for faster deploys: Create a GitHub Personal Access Token with read:packages scope and store as a secret.

Smoke test fails after deploy

Symptoms: Smoke test reports "APISIX admin API not reachable" even though services are running.

Root cause: The smoke test script (scripts/smoke-test.sh) uses localhost:9180 which is not exposed in production. Or --force-recreate restarted APISIX and it needs time to reconnect to etcd.

Fix: Production deploy smoke tests should use external URLs:

bash
curl -s https://api.yourdomain.com/health
curl -s https://operator.yourdomain.com

Data Issues

ClickHouse analytics charts show no data

Symptoms: Analytics page shows empty charts despite the platform being operational.

Root cause: The clickhouse-logger plugin is not attached to APISIX routes, so no request data flows to ClickHouse.

Verify:

bash
# Check if routes have clickhouse-logger
docker exec apisix curl -s http://localhost:9180/apisix/admin/routes \
  -H "X-API-KEY: your-key" | grep clickhouse-logger

# Check ClickHouse row count
docker exec clickhouse clickhouse-client --query "SELECT count() FROM gateway.request_log_raw"

Fix: The CP API auto-injects clickhouse-logger on new routes. For existing routes, re-create them or seed demo data:

bash
bash scripts/seed-demo-data.sh

Keycloak token rejected by CP API

Symptoms: API calls return "Invalid or expired token" even with a valid Keycloak session.

Root cause: The CP API validates JWT signatures via JWKS. If the JWKS cache is stale or the Keycloak realm doesn't have the expected client, validation fails.

Verify:

bash
# Check CP API logs for JWKS refresh
docker logs control-plane-api --tail 20 | grep JWKS

# Test with a real Keycloak token
TOKEN=$(curl -s -X POST "http://gw-keycloak:8080/realms/gateway/protocol/openid-connect/token" \
  -d "client_id=operator-portal" -d "username=operator" -d "password=demo123" \
  -d "grant_type=password" | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

curl -H "Authorization: Bearer $TOKEN" http://localhost:4001/api/tenants

Fix: Verify KEYCLOAK_JWKS_URL environment variable points to the correct realm. CP API logs should show "JWKS cache refreshed, key_count: 2".


CI/CD Issues

TypeScript lint fails in CI

Symptoms: CI job "TypeScript Lint" fails with import.meta.env errors.

Root cause: tsconfig.base.json missing "types": ["vite/client"]. Vite augments ImportMeta with env var types.

Fix: Already fixed. Ensure tsconfig.base.json includes:

json
{
  "compilerOptions": {
    "types": ["vite/client"]
  }
}

Security check fails with ".claude/ files committed"

Symptoms: CI security check reports ".claude/ files are committed!"

Root cause: Someone used git add -f .claude/ to force-add files that are in .gitignore.

Fix:

bash
git rm --cached .claude/product/roadmap.md  # remove from tracking, keep local
git commit -m "Remove .claude from git tracking"

Pre-push hook fails with uncommitted files

Symptoms: git push rejected with "N uncommitted file(s) in source directories".

Root cause: The auto-fix code review agent modified files in the background after your commit.

Fix: Stage and commit the additional changes:

bash
git add <files listed>
git commit -m "fix: auto-fix agent improvements"
git push

Performance Issues

Keycloak slow to start (30-90 seconds)

Symptoms: Keycloak healthcheck takes 60+ seconds. Tests that depend on Keycloak fail with timeouts.

Root cause: Keycloak JVM startup + realm import takes 30-90 seconds, especially on CI runners with limited CPU.

Fix: Use retry logic when checking Keycloak:

bash
# Retry up to 90 seconds
for i in $(seq 1 18); do
  curl -sf http://localhost:8081/realms/gateway && break
  sleep 5
done

Docker Compose stack uses too much memory

Symptoms: Services crash or become unresponsive, especially Keycloak and ClickHouse.

Root cause: The full stack (18 containers) requires ~8-10 GB RAM. Docker Desktop default may be 4 GB.

Fix: Increase Docker Desktop memory to 16 GB. Or reduce Keycloak heap:

yaml
# docker-compose.yml
JAVA_OPTS_APPEND: -Xms256m -Xmx512m

Enterprise API + AI + Agent Gateway