• Start

Monitoring & observability

Health checks, OpenTelemetry metrics and traces, audit logs and slow-query logs, and integration with common observability stacks.

SurrealDB exposes a built-in /health HTTP endpoint suitable for load balancer and orchestrator probes. A successful response indicates the process is accepting requests. Combine /health with deeper checks so you detect partial failures — slow queries, disk pressure, replication lag — before probes alone would fire.

For everything beyond a liveness probe, the Observability section is the canonical reference and operator guide. The pages there cover metrics, telemetry access, audit logs and slow-query logs for both Community and Enterprise editions.

The OpenTelemetry exporter speaks OTLP gRPC, so anything that ingests OTLP works. Two common patterns:

  • Prometheus pull. Scrape the built-in /metrics endpoint. Anonymous scrapers see only the public allowlist; pass root credentials to unlock the full surface. Build Grafana dashboards on the resulting time series and pair with Alertmanager for the alert hints recommended for production.

  • OTLP push. Point SurrealDB at an OpenTelemetry collector and route from there into Prometheus remote-write, Tempo / Jaeger for traces, and Loki or a SIEM for logs (including audit and slow-query records when their OTel export is opted in). Label streams by environment (production, staging, dev) at the collector so dashboards do not mix traffic accidentally.

Either path uses the same surrealdb.* instrument namespace and the same service.edition resource attribute, so dashboards travel cleanly across deployments.

The full alert-hints starter set lives on the metrics page, but the must-watch signals for any production deployment are:

  • Error raterate(surrealdb_statement_total{outcome="error"}[5m]) rising sharply against baseline.

  • Transaction conflictsrate(surrealdb_transaction_conflicts_total[5m]), which doubles as the retry-pressure signal.

  • Latency tails — histogram percentiles on surrealdb_statement_duration_seconds, surrealdb_query_duration_seconds, surrealdb_http_request_duration_seconds.

  • Active sessions and live queriessurrealdb_session_active, surrealdb_live_query_active for capacity planning.

  • Audit pipeline health (Enterprise)surrealdb_audit_dropped, surrealdb_audit_append_errors, surrealdb_audit_queue_depth to catch lost or delayed audit records.

Define SLOs where appropriate — latency and availability targets — and use burn-rate alerts on the resulting error budgets rather than relying on one-off thresholds alone.

Was this page helpful?