Troubleshooting Guide

This guide covers common issues encountered during Gatez deployment and operation, with root causes and fixes derived from real production debugging.

Portal Issues

"Could not load configuration" on tenant drawer tabs

Symptoms: Prompt Config, Usage, Providers, or Guards tabs show "Could not load configuration. Check your connection or try again."

Root cause: The page uses a hardcoded API URL (like VITE_CP_API_URL) that falls back to localhost:4001, which is unreachable in production (ports are not exposed).

Fix: All API calls must use apiFetch() from @gatez/api-client with relative paths. The apiFetch function prepends VITE_API_URL automatically.

typescript

// Wrong
const resp = await fetch(`${CP_API}/api/tenants/${id}/prompt-config`);

// Correct
const resp = await apiFetch(`/api/tenants/${id}/prompt-config`);

Symptoms: Developer or operator portal loads directly without redirecting to Zitadel login page.

Root cause: VITE_ZITADEL_URL is empty in the JS bundle. This happens when the portal Docker image was built without passing the build arg, or when BuildKit cached an old layer.

Verify:

bash

# Check if zitadel URL is in the bundle
docker exec operator-portal grep -c "zitadel" /usr/share/nginx/html/assets/*.js
# Should return > 0

Fix: Rebuild portal with explicit build args:

bash

docker build --no-cache \
  --build-arg VITE_ZITADEL_URL=https://zitadel.yourdomain.com \
  --build-arg VITE_ZITADEL_ORG=gateway \
  --build-arg VITE_API_URL=https://api.yourdomain.com \
  -f layers/control-plane/apps/operator/Dockerfile \
  -t gatez-operator-portal:latest \
  layers/control-plane/

Dark mode resets on page refresh

Symptoms: Theme reverts to light mode after refreshing.

Root cause: Theme preference was not persisted to localStorage.

Fix: Already fixed in the codebase. The theme is saved to localStorage.setItem('gatez-theme', theme) and restored on app startup before React renders.

Portal shows "unhealthy" in Docker

Symptoms: docker ps shows operator-portal or developer-portal as (unhealthy).

Root cause: The Docker healthcheck uses wget -qO- http://localhost:80/ but the portal serves correctly via Caddy externally.

Impact: Cosmetic only. The portal works fine when accessed via its external URL.

Verify:

bash

curl -s -o /dev/null -w "%{http_code}" https://operator.yourdomain.com
# Should return 200

APISIX Issues

APISIX returns 400 when creating a route with a plugin

Symptoms: PUT /apisix/admin/routes/{id} returns HTTP 400.

Root cause: The plugin configuration doesn't match the plugin's JSON schema. Every APISIX plugin validates its config on route creation.

Common examples:

limit-count requires count and time_window (not just {})
ip-restriction requires either whitelist or blacklist
traffic-split requires rules with weighted_upstreams

Fix: Provide valid config. See APISIX Plugin Reference for each plugin's required fields.

Custom plugin not loadable on routes

Symptoms: Route creation with clickhouse-logger, group-auth, or response-pii-scrub returns 400 even with valid config.

Root cause: The extra_lua_path in apisix.yaml doesn't match the Docker volume mount path for custom plugins.

Verify:

bash

# Check where plugins are mounted
docker inspect apisix --format '{{range .Mounts}}{{.Source}} -> {{.Destination}}{{"\n"}}{{end}}' | grep plugin

# Check extra_lua_path in config
docker exec apisix cat /usr/local/apisix/conf/config.yaml | grep extra_lua_path

Fix: Ensure the volume mount paths match the extra_lua_path pattern in apisix.yaml.

APISIX admin API not reachable in production

Symptoms: curl localhost:9180/apisix/admin/routes returns "Connection refused".

Root cause: The production overlay (docker-compose.prod.yml) strips all ports with ports: !reset []. APISIX admin is only accessible within the Docker network.

Fix: Use docker exec to access from inside the network:

bash

docker exec apisix curl -s http://localhost:9180/apisix/admin/routes \
  -H "X-API-KEY: your-admin-key"

Or access via the CP API which queries APISIX internally.

Deploy Issues

Container name conflict ("already in use")

Symptoms: Deploy fails with "The container name /operator-portal is already in use".

Root cause: A manually-created container (docker run --name operator-portal) conflicts with a compose-managed container of the same name.

Fix:

bash

docker rm -f operator-portal developer-portal
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

deploy_only guard fails

Symptoms: "No successful CI run found for SHA" when using deploy_only=true.

Root cause: The auto-fix agent pushed a commit after the last CI passed, creating a new SHA with no green CI.

Fix: Run a full pipeline (not deploy_only):

bash

gh workflow run ci.yml --ref main
# Not: gh workflow run ci.yml --ref main -f deploy_only=true

GHCR pull returns 403

Symptoms: "403 Forbidden" when pulling images from ghcr.io on the VM.

Root cause: Private repository images require authentication. The GITHUB_TOKEN in CI may not have read:packages scope when used from the VM.

Impact: The deploy script falls back to building locally (docker compose build), which is slower but works.

Fix for faster deploys: Create a GitHub Personal Access Token with read:packages scope and store as a secret.

Smoke test fails after deploy

Symptoms: Smoke test reports "APISIX admin API not reachable" even though services are running.

Root cause: The smoke test script (scripts/smoke-test.sh) uses localhost:9180 which is not exposed in production. Or --force-recreate restarted APISIX and it needs time to reconnect to etcd.

Fix: Production deploy smoke tests should use external URLs:

bash

curl -s https://api.yourdomain.com/health
curl -s https://operator.yourdomain.com

Data Issues

ClickHouse analytics charts show no data

Symptoms: Analytics page shows empty charts despite the platform being operational.

Root cause: The clickhouse-logger plugin is not attached to APISIX routes, so no request data flows to ClickHouse.

Verify:

bash

# Check if routes have clickhouse-logger
docker exec apisix curl -s http://localhost:9180/apisix/admin/routes \
  -H "X-API-KEY: your-key" | grep clickhouse-logger

# Check ClickHouse row count
docker exec clickhouse clickhouse-client --query "SELECT count() FROM gateway.request_log_raw"

Fix: The CP API auto-injects clickhouse-logger on new routes. For existing routes, re-create them or seed demo data:

bash

bash scripts/seed-demo-data.sh

Zitadel token rejected by CP API

Symptoms: API calls return "Invalid or expired token" even with a valid Zitadel session.

Root cause: The CP API validates JWT signatures via JWKS. If the JWKS cache is stale or the Zitadel organization doesn't have the expected client, validation fails.

Verify:

bash

# Check CP API logs for JWKS refresh
docker logs control-plane-api --tail 20 | grep JWKS

# Test with a real Zitadel token
TOKEN=$(curl -s -X POST "http://gw-zitadel:8080/oauth/v2/token" \
  -d "client_id=operator-portal" -d "username=operator" -d "password=demo123" \
  -d "grant_type=password" | python3 -c "import sys,json; print(json.load(sys.stdin)['access_token'])")

curl -H "Authorization: Bearer $TOKEN" http://localhost:4001/api/tenants

Fix: Verify ZITADEL_JWKS_URL environment variable points to the correct organization. CP API logs should show "JWKS cache refreshed, key_count: 2".

CI/CD Issues

TypeScript lint fails in CI

Symptoms: CI job "TypeScript Lint" fails with import.meta.env errors.

Root cause: tsconfig.base.json missing "types": ["vite/client"]. Vite augments ImportMeta with env var types.

Fix: Already fixed. Ensure tsconfig.base.json includes:

json

{
  "compilerOptions": {
    "types": ["vite/client"]
  }
}

Security check fails with ".claude/ files committed"

Symptoms: CI security check reports ".claude/ files are committed!"

Root cause: Someone used git add -f .claude/ to force-add files that are in .gitignore.

Fix:

bash

git rm --cached .claude/product/roadmap.md  # remove from tracking, keep local
git commit -m "Remove .claude from git tracking"

Pre-push hook fails with uncommitted files

Symptoms: git push rejected with "N uncommitted file(s) in source directories".

Root cause: The auto-fix code review agent modified files in the background after your commit.

Fix: Stage and commit the additional changes:

bash

git add <files listed>
git commit -m "fix: auto-fix agent improvements"
git push

Performance Issues

Zitadel slow to start (30-60 seconds)

Symptoms: Zitadel healthcheck takes 60+ seconds. Tests that depend on Zitadel fail with timeouts.

Root cause: Zitadel startup + organization initialization takes 30-60 seconds, especially on CI runners with limited CPU.

Fix: Use retry logic when checking Zitadel:

bash

# Retry up to 90 seconds
for i in $(seq 1 18); do
  curl -sf http://localhost:8085/debug/healthz && break
  sleep 5
done

Docker Compose stack uses too much memory

Symptoms: Services crash or become unresponsive, especially Zitadel and ClickHouse.

Root cause: The full stack (18 containers) requires ~8-10 GB RAM. Docker Desktop default may be 4 GB.

Fix: Increase Docker Desktop memory to 16 GB. Or reduce Zitadel resource limits:

yaml

# docker-compose.yml
deploy:
  resources:
    limits:
      memory: 512M

Troubleshooting Guide ​

Portal Issues ​

"Could not load configuration" on tenant drawer tabs ​

Portal opens without Zitadel login redirect ​

Dark mode resets on page refresh ​

Portal shows "unhealthy" in Docker ​

APISIX Issues ​

APISIX returns 400 when creating a route with a plugin ​

Custom plugin not loadable on routes ​

APISIX admin API not reachable in production ​

Deploy Issues ​

Container name conflict ("already in use") ​

deploy_only guard fails ​

GHCR pull returns 403 ​

Smoke test fails after deploy ​

Data Issues ​

ClickHouse analytics charts show no data ​

Zitadel token rejected by CP API ​

CI/CD Issues ​

TypeScript lint fails in CI ​

Security check fails with ".claude/ files committed" ​

Pre-push hook fails with uncommitted files ​

Performance Issues ​

Zitadel slow to start (30-60 seconds) ​

Docker Compose stack uses too much memory ​

Troubleshooting Guide

Portal Issues

"Could not load configuration" on tenant drawer tabs

Portal opens without Zitadel login redirect

Dark mode resets on page refresh

Portal shows "unhealthy" in Docker

APISIX Issues

APISIX returns 400 when creating a route with a plugin

Custom plugin not loadable on routes

APISIX admin API not reachable in production

Deploy Issues

Container name conflict ("already in use")

deploy_only guard fails

GHCR pull returns 403

Smoke test fails after deploy

Data Issues

ClickHouse analytics charts show no data

Zitadel token rejected by CP API

CI/CD Issues

TypeScript lint fails in CI

Security check fails with ".claude/ files committed"

Pre-push hook fails with uncommitted files

Performance Issues

Zitadel slow to start (30-60 seconds)

Docker Compose stack uses too much memory