Skip to main content

Observability

The manager collects metrics through Micrometer and routes them to exactly one sink per process. The sink is selected by quack-on-demand.metrics.sink (QOD_METRICS_SINK), one of prometheus, aws, azure, gcp, or none. For the full list of emitted series and their labels, see the Metrics reference.

Quack-on-Demand operator dashboard in Grafana

Prometheus pull (the default)

With sink = prometheus (the default) the manager exposes an unauthenticated scrape endpoint, same policy as /health:

GET http://<host>:20900/metrics

A minimal scrape config:

scrape_configs:
- job_name: quack-on-demand
static_configs:
- targets: ['quack-manager.svc:20900']
metrics_path: /metrics

Local stack via the observability profile

The compose stack bundles Prometheus and Grafana behind the observability profile. Bring up the manager, Postgres, Prometheus, and Grafana in one command, with TPC-H seeded so the dashboard has live data:

LOAD_TPCH=1 PROFILES=observability ./scripts/run-docker-compose.sh

# Clean slate
NUKE=1 LOAD_TPCH=1 PROFILES=observability ./scripts/run-docker-compose.sh

Prometheus scrapes the manager container directly over the compose network. Grafana is preprovisioned with the Prometheus datasource and the bundled dashboard, so it renders without manual setup. The boot output prints the URLs:

Manager UI:    http://localhost:20900/ui/       (admin / admin)
Prometheus: http://localhost:9090 (try query: up)
Grafana: http://localhost:3000 (anonymous admin; no login)
Dashboard: "Quack-on-Demand - Operator Overview"

Grafana runs anonymous-admin for zero-login local use; do not expose port 3000 to a public network without disabling that. See the Docker deployment page for the profile mechanics. Tear down with docker compose -f docker-compose.yml --profile observability down.

Standalone Prometheus + Grafana

When the manager runs outside compose (for example in Kubernetes, reached by kubectl port-forward), bring up only the observability containers from the observability/ directory, which scrape the manager via host.docker.internal:20900:

kubectl -n qod port-forward svc/qod-quack-on-demand 20900:20900 &
docker compose -f observability/docker-compose.yml up -d

The preprovisioning (datasource UID, auto-loaded dashboard) is identical to the integrated path.

Cloud push (aws / azure / gcp)

When a cloud sink is selected the manager pushes metrics on a fixed cadence (default 60s) via the cloud SDK, and the /metrics Prometheus endpoint is not exposed.

SinkSelect withRequired configCredentials
awsQOD_METRICS_SINK=awsQOD_METRICS_AWS_NAMESPACE (default quack-on-demand)DefaultCredentialsProvider chain (IAM role, env, profile)
azureQOD_METRICS_SINK=azureQOD_METRICS_AZURE_KEY (Application Insights key, required)DefaultAzureCredential (managed identity, env, CLI)
gcpQOD_METRICS_SINK=gcpQOD_METRICS_GCP_PROJECT_ID (required)ADC (GOOGLE_APPLICATION_CREDENTIALS, GCE metadata, gcloud)

Override the cadence with QOD_METRICS_AWS_STEP_SEC / QOD_METRICS_AZURE_STEP_SEC / QOD_METRICS_GCP_STEP_SEC.

Only one sink runs per process. There are no per-backend enable flags; the single sink field is the sole selector. Selecting a cloud sink means /metrics is unavailable and no other sink is active.

Common labels

Attach static labels to every series to distinguish environments in a shared Grafana:

VariableHOCON keyPurpose
QOD_METRICS_DEPLOYMENTmetrics.commonTags.deploymentDeployment name, e.g. prod-eu
QOD_METRICS_REGIONmetrics.commonTags.regionCloud region, e.g. eu-west-1

Disabling metrics

Set QOD_METRICS_SINK=none: no /metrics endpoint is mounted, no cloud push occurs, and all counters, timers, and gauges become no-ops.

The bundled Grafana dashboard

observability/grafana-dashboard.json is a single-screen operator overview, ready to import (Grafana 10.x: Dashboards → New → Import → Upload JSON file, then pick your Prometheus datasource; the ${datasource} variable resolves to its UID).

RowPanels
OverviewTotal QPS, error rate, active sessions, total nodes
Latencyp50 / p95 / p99 statement-duration percentiles
By TenantStacked QPS per tenant, outcomes by status
Pool OccupancyNode count by tenant / pool / role
Node HealthPer-node table: healthy, draining, in-flight, EWMA latency
JVMHeap used, GC pause rate, live threads, process uptime

The metric names and labels these panels query are listed in the Metrics reference. For the QOD_METRICS_* configuration keys, see the Configuration reference.