SurrealDB exposes a built-in /health HTTP endpoint suitable for load balancer and orchestrator probes. A successful response indicates the process is accepting requests.
Combine /health with deeper checks so you detect partial failures—slow queries, disk pressure, or replication lag—before probes alone would fire.
For metrics and distributed traces, enable OpenTelemetry as described in Observability: set the telemetry provider to OTLP and point exporters at your collector.
That gives you consistent spans and metrics alongside other services. Label streams by environment (production, staging) so dashboards do not mix traffic accidentally.
Key metrics to monitor include query latency (p50/p95/p99 where available), error rates by endpoint, active connections, and storage usage or growth on the backing engine. Alert on sustained latency increases, connection exhaustion, or free-space thresholds before they become outages.
You can route OTLP into Prometheus-compatible scrapers or remote-write targets, build Grafana dashboards, and trace requests in Jaeger or vendor APM—whatever your organisation already standardises on—so SurrealDB appears on the same boards as the rest of the stack.
Log aggregation (structured logs shipped to your central store) complements metrics: correlate trace IDs, query text where safe to log, and server events during incidents. Define SLOs where appropriate—for example latency and availability targets—and use burn-rate alerts on error budgets rather than noisy one-off thresholds alone.