Loading module
Resolving locale, route permissions, and workspace projection.
KES Outbox E2E Runbook (15-min Onboarding)
Last updated: 2026-03-06
Purpose:
- Bring up full local KES outbox flow.
- Run smoke checks.
- Execute DLQ and poison replay workflows.
- Resolve common local failures quickly.
Target DoD:
- A new engineer can run end-to-end in <= 15 minutes without external help.
0) Prerequisites
- Docker Desktop running.
- Node.js + npm installed.
- Repo opened at root (
kvary.network).
Windows note:
- In PowerShell, do not use Linux-style inline env assignments (
FOO=bar cmd).
- Prefer existing npm scripts (they already use
cross-env where needed).
1) Start (fresh local)
Run from repo root:
READ 2026-03-28T23:42:21.903Z
READ 2026-03-29T04:52:41.869Z
CORE STRICT SAFE DELETE AFTER RERUN REPORT
PUBLIC | DRAFT | v1.0.0
READ 2026-03-29T03:13:33.020Z
npm run db:up
npm run kafka:up
npm run monitoring:up
npm run migrate:all
Open a second terminal and start outbox relay:
npm run relay:kes-outbox:dev
Expected healthy endpoints:
curl -sS http://127.0.0.1:4100/health
curl -sS http://127.0.0.1:4020/health
curl -sS http://127.0.0.1:4060/health
curl -sS "http://127.0.0.1:9090/api/v1/query?query=kes_outbox_relay_up"
- auth/tenders health return
{"ok":true}
- relay health returns
{"ok":true,...}
- Prometheus query returns value
1 for kes_outbox_relay_up
2) Smoke
npm run tenders:outbox:live-smoke
- JSON output with
"ok": true
- relay dispatch delta > 0
- URL:
http://localhost:3002
- Login:
admin / admin
- Dashboard:
KES Outbox Overview
3) Replay Operations
DLQ replay (consumer DLQ):
# Dry-run
npm --prefix services/svc-tenders run replay:kes-dlq -- --from-beginning --max-messages 50
# Execute
npm --prefix services/svc-tenders run replay:kes-dlq -- --execute --max-messages 50
# Dry-run
npm --prefix services/svc-tenders run replay:kes-outbox-poison -- --max-rows 50
# Execute
npm --prefix services/svc-tenders run replay:kes-outbox-poison -- --execute --max-rows 50
4) Stop
- Stop
dev:one terminal (Ctrl+C).
- Stop relay terminal (
Ctrl+C).
- Stop infra:
npm run monitoring:down
npm run kafka:down
docker compose -f docker-compose.postgres.yml down
5) Common Failures
A) EADDRINUSE (port already in use)
listen EADDRINUSE ... :3000/:4001/:4010/:4020/:4060/:4100
node scripts/free-port.js 3000
node scripts/free-port.js 4001
node scripts/free-port.js 4010
node scripts/free-port.js 4020
node scripts/free-port.js 4060
node scripts/free-port.js 4100
Then rerun startup command.
B) Auth service down
- frontend
401 /api/v1/auth/me repeats
- gateway
502 on auth/oidc routes
curl -sS http://127.0.0.1:4100/health
- Ensure
dev:one is running.
- Or start auth only:
C) Migration missing (relation ... does not exist)
- relay fails with
relation "kes_outbox_events" does not exist
- service errors for missing tables
If only tenders schema is missing:
npm --prefix services/svc-tenders run migrate
D) Grafana shows No data
curl -sS http://127.0.0.1:4060/metrics
curl -sS "http://127.0.0.1:9090/api/v1/query?query=kes_outbox_relay_up"
- Start relay (
npm run relay:kes-outbox:dev)
- Confirm Prometheus is up (
npm run monitoring:up)
6) 15-minute DoD Checklist
- [ ]
db + kafka + monitoring started.
- [ ]
migrate:all completed with no errors.
- [ ]
dev:one running.
- [ ]
relay:kes-outbox:dev running and /health is ok:true.
- [ ]
tenders:outbox:live-smoke returns "ok": true.
- [ ] Grafana
KES Outbox Overview shows relay/pending/dispatch metrics.
- [ ] Dry-run replay commands execute without crash.