Resilience and recovery
This page describes what the code does today. Aspirational items are noted as gaps. Where a known gap maps to a tracked issue, the issue number is included.
Topology
quack-on-demand runs as a single-instance manager process. It is designed to be safely restartable, not active-active. Running two managers against the same Postgres control-plane database is not safe today: both would attempt to reconcile the same node registry and race on database creation (tracked: #11 for v0.4).
Worker pools scale horizontally. A pool can contain any number of Quack nodes; the router distributes statements across all healthy nodes in the pool. Adding nodes increases query throughput without touching the manager.
Cold start and reconciliation
When the manager JVM exits and a supervisor restarts it (systemd, Kubernetes, or a manual rerun of run-jar.sh), the following sequence runs:
-
State restored from Postgres.
PoolSupervisor.restore()reads the normalizedqodstate_tenant,qodstate_tenant_db,qodstate_pool, andqodstate_nodetables (managed by Liquibase) and re-hydrates the registry into in-memoryTrieMaps. The RBAC graph (qodstate_role,qodstate_role_permission,qodstate_group,qodstate_user_role,qodstate_user_group,qodstate_group_role,qodstate_pool_permission) is rebuilt into the per-sessionEffectiveSeton each connection. -
Existing Kubernetes pods adopted.
KubernetesQuackBackend.discoverExisting()selects pods by the manager's label (managed-by=quack-on-demand) and re-binds them to the in-memory registry. A manager restart does not tear down running pods. Local mode does not adopt survivors (LocalQuackBackend.discoverExisting()returnsList.empty), so on a local-mode restart the restored pool state references processes that no longer exist; the reconcile pass below respawns them. -
Reconciliation.
PoolSupervisor.reconcile()compares the restored desired state against what the runtime backend reports as alive (PID + socket check for local, podReadycondition for Kubernetes) and respawns any nodes that should be present but are not. The method is idempotent. -
Bootstrap re-seed.
Main.scalare-runs the bootstrap sequence on every start. Each step is idempotent: the named tenant/tenant-db/pool are skipped when they already exist, the admin user upsert re-hashes the password, and the built-inadminrole with its wildcard permission is a no-op on re-entry.
Typical cold-boot time on a development machine: roughly 5 s JVM start plus 1 s Liquibase schema diff plus 1 s reconcile plus about 3 s per respawned node. A 3-node pool is back in service in approximately 15 s. A first-ever boot adds another 1-2 s per tenant-db for CREATE DATABASE and DuckLake metadata table initialization.
Health checks
HealthProbe runs a background fiber that pings each node's /ping endpoint on a fixed interval. The default interval is 5 seconds (healthCheckIntervalSec = 5 in application.conf, overridable via QOD_HEALTH_CHECK_INTERVAL_SEC). Each tick updates the NodeLoadTracker's healthy flag for the node. The router's pick() method excludes nodes where healthy = false.
When the ping function throws, the probe catches the exception and marks the node unhealthy rather than terminating the loop.
The first successful probe per node also runs CREATE SCHEMA IF NOT EXISTS <db>.<schema> so the pool's default schema exists before FlightSqlRouter.wrapWithDefaultSchema ever prepends a USE statement to client queries. Subsequent probes revert to plain SELECT 1.
Cold-boot quarantine gap. A newly spawned or restored node starts with healthy = true (the default in NodeLoad.empty). The node can therefore receive traffic before its first probe confirms it is reachable. The ~5 s window between spawn and the first successful probe is a known gap: a node that is slow to start can receive statements it cannot yet handle and return transient errors. A future improvement would initialize new nodes with healthy = false and flip to true only after a confirmed probe.
In-transaction node death
FlightSQL sessions that have issued BEGIN are pinned to a specific node for the duration of the transaction. If the pinned node disappears or returns a transient failure before COMMIT or ROLLBACK:
FlightSqlRouterdetects either aRoutingDecision.PinnedNodeGoneresult (the node is no longer in the snapshot) or aQuackResponse.Failed(QuackError.Transient, ...)response whiletxOpen = true.- In both cases the router calls
SessionRegistry.invalidatePin, which clears the pinned node and resetstxOpen = false. - The current statement returns an error to the client (
"pinned node disappeared; transaction lost"or"transient failure inside transaction: <detail>"). - The client must reconnect and replay the transaction from
BEGIN.
There is no transparent replay. quack-on-demand does not buffer or re-execute the in-flight transaction. Clients should handle pin-lost and no-node error strings via standard retry logic. Each occurrence is recorded in statements_total with the appropriate status label so it is visible on a metrics panel.
Outside a transaction. A transient failure on a statement that is not inside a BEGIN block triggers a single automatic retry on a different node (retryOnce in FlightSqlRouter). The excluded node is filtered from the snapshot for that retry pick. If the retry also fails, the error is returned to the client.
What survives a restart and what does not
Durable (survives restart):
- All control-plane state: tenants, tenant-databases, pools, nodes, users, roles, groups, permissions. These live in Postgres
qodstate_*tables and are restored byPoolSupervisor.restore()on every boot. - Pool node topology. Kubernetes pods are adopted; local processes are respawned to match the stored desired state.
Lost on every restart:
| State | Location | Impact |
|---|---|---|
| Statement history | StatementHistoryStore - 256-entry in-memory ring buffer | The admin UI "Recent statements" panel resets. No post-mortem trail from before the crash. |
| Per-node EWMA latency and total served | NodeLoadTracker | Routing load data resets to zero. Traffic distributes evenly until the EWMA converges over the next few seconds of live traffic. |
| Per-node latency histogram (p50/p95/p99) | NodeLoadTracker latency ring (256-sample window) | UI latency widgets reset to zero. |
| FlightSQL sessions and session pins | SessionRegistry | Every client must reconnect. Any open transaction is implicitly rolled back at the Quack node level. |
| Admin UI session tokens | SessionTokenStore | Admin UI users must log in again. The static QOD_API_KEY continues to work. |
All of these recover through re-population from live traffic. None cause incorrect behavior; they only create gaps in operator-visible signal during and immediately after a restart.
Failure and recovery matrix
| Failure | Detection | Manager behavior | Impact | Tracked gap |
|---|---|---|---|---|
| Quack node JVM crash | HealthProbe /ping tick (5 s default) plus PID check (local only) | Local: respawn via spawn-quack-node.sh. Kubernetes: kubelet restart, manager waits for pod Ready. | New traffic routes to other healthy nodes. Sessions pinned to the dead node are invalidated on next statement. | - |
| Manager JVM crash (OOM, panic) | Process supervisor (systemd, kubelet, manual rerun) | Cold restart: restore from Postgres, reconcile. | All FlightSQL sessions drop. Approximately 15 s to fully reconcile a 3-node pool. | Graceful shutdown (#2) |
| Postgres brief outage | Hikari throws on connection acquire | First state-changing request gets a 500. No automatic retry wrapper. | Read-only requests served from in-memory state (including cached EffectiveSets on live FlightSQL sessions) continue to work. Writes, new tenant-db creation, and new-session handshakes all fail. | Need retry wrapper (no issue yet) |
| Postgres down for minutes | Same as above | Manager enters degraded state: established FlightSQL sessions keep flowing but createPool, createTenantDb, setRole, and RBAC CRUD all fail. New connections cannot rebuild the EffectiveSet and bounce at handshake. | Existing FlightSQL traffic continues. | Same |
| Manager host loss (Kubernetes node evict) | kubelet | New pod scheduled; cold restart sequence runs on a different host. | Same as JVM crash. Set terminationGracePeriodSeconds to at least 30 s. | - |
| Network partition between manager and a node | HealthProbe flips healthy = false after one tick | Node excluded from Router.pick(). | Sessions pinned to that node are invalidated on next statement. Outside-transaction statements retry once on a different node. | - |
| Network partition between manager and all nodes | All nodes flip healthy = false | Every routing decision returns Unavailable("no node compatible"). FlightSQL responses become errors. | Total query outage until partition heals. The manager process itself does not crash. | - |
FlightSQL edge crash (FlightProducerImpl exception) | The wrapping IO returns Left(throwable) | Main.scala logs the error and parks on IO.never. The JVM stays up but FlightSQL is dead. | Admin UI and /metrics continue working. FlightSQL is silently down. | Should exit non-zero so the supervisor restarts (no issue yet) |
| Disk full on manager host | Logback RollingFileAppender drops writes; JVM may OOM | Manager eventually crashes. | Same as manager JVM crash. | - |
| Disk full on a Quack node (Parquet write fails) | Node returns 5xx from /quack; adapter classifies as transient | Router tries a different node (retry-once outside tx; pin invalidation inside tx). | Reads continue. Writes fail until disk is cleared. | - |
| TLS cert expiry | First TLS handshake fails | Manager refuses new FlightSQL connections. | The auto-generated cert in certs/ has a 10-year validity. Only a concern for production deployments using externally issued certificates. | Rotate via cert-manager in Kubernetes. |
| Two managers against the same Postgres | Both restore the same state and both try to reconcile | Both attempt to spawn pods with the same node IDs (Kubernetes API returns 409 for the second create; local mode races on port allocation). DbAdmin.createDatabase for new tenant-dbs races: one wins, the other sees "database already exists". Not safe today. | Avoid. | Multi-manager HA (#11) |
Operational guidance
If you are running this in production today (single-manager plus Postgres):
-
Run under a process supervisor that restarts the JVM on exit: systemd with
Restart=always, a KubernetesDeploymentwithrestartPolicy: Always(the default), or Docker withrestart: unless-stopped. -
Add Kubernetes readiness and liveness probes before exposing the manager to traffic:
readinessProbe:
httpGet: { path: /health, port: 20900 }
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet: { path: /health, port: 20900 }
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 5The
/healthendpoint returns OK afterPoolSupervisor.restore()andreconcile()complete their first pass. -
Back up Postgres with point-in-time recovery. Everything that survives a manager restart lives there.
-
Set
terminationGracePeriodSeconds: 60so in-flight FlightSQL queries have time to complete before the JVM is killed (useful even without a graceful shutdown handler). -
Monitor
statements_total{status!="ok"}rate. A spike intransient,no_node, orpin_lostis the leading indicator for node trouble. A spike inpermanenttypically means client-side SQL errors. -
Set up an external
/healthwatcher independent of the JVM (for example, a Prometheusprobe_successcheck). Routine reachability is the first thing to know when investigating an outage. -
Design FlightSQL clients with retry logic. ADBC includes it; JDBC pools usually do; raw gRPC code needs explicit handling. A manager restart is the most common interruption clients will encounter.