NEW BLOG

Using Surrealism to build your own extensions

Read blog

1/2

Streamlining infrastructure monitoring logo

Streamlining infrastructure monitoring

Industry:

Technology

Outcome:

Simplified stack from 9 tools to one to build unified graph-first monitoring platform for real-time incident analysis

Overview: building a graph-first monitoring platform

Tencent operates internet-scale cloud and infrastructure services where reliability, fast incident response, and deep system visibility are non-negotiable. To simplify how teams monitor and debug production systems, Tencent integrated SurrealDB to unify its infrastructure monitoring stack. By combining native graph capabilities with versioned data access and built-in analytics, the team replaced a fragmented toolchain with a single, scalable, real-time monitoring platform.

Before SurrealDB, the monitoring and analysis experience spanned a patchwork of systems - MySQL, Elasticsearch, VictoriaMetrics (Prometheus-compatible), MongoDB, Doris, Trino, RisingWave, Flink (batch and stream processing), and Dgraph. This meant every investigation started with a meta-problem: choosing the “right” storage or compute engine before you could even answer the question.

Challenge: the cost of a fragmented data stack

Tencent’s platform managed many databases and processing layers at once. In practice, users had to decide which system to query depending on whether they were exploring time-oriented telemetry, searching, running analytics, or traversing relationships. For data analysts, that decision was especially challenging without deep familiarity with the strengths and tradeoffs of each engine. For the platform management team, the cost was even clearer: more systems to operate, more upgrades and failure modes to manage, more governance to enforce, and a higher ongoing maintenance burden.

At the same time, the most valuable monitoring workflows were increasingly graph-shaped. The team needed to model real operational context - process behavior, parent/child derivations, and traffic relationships - and then query that context fast during incidents. Their data came in two complementary streams: periodic process “snapshots” that capture the state of a host at a moment in time, and real-time changelogs that record process actions as time-series messages. From those inputs, Tencent builds a process tree where process IDs are nodes, process attributes are node properties, and derivation relationships are directed edges.

The core requirement wasn’t only to build and query the graph - it was to keep the graph continuously updated and versioned. Each new snapshot or changelog updates the previous graph state, producing a new “tree” that must be available quickly enough to support policy judgment. When an alert triggers, engineers need to reconstruct context from a window of time leading up to the fault, find the first appearance of a process, and understand how the state evolved from one version to another. Doing this across multiple engines made the workflow slow, complex, and operationally expensive.

Solution: consolidating into a multi-model context graph

Tencent adopted SurrealDB as a unified data layer for monitoring correlation and graph-driven fault analysis, consolidating the existing toolchain into a single system that natively supports document data and graph relationships. Instead of forcing users to understand which storage or compute engine should answer each question, SurrealDB provides a consistent model for querying operational context; especially when that context is best represented as a graph.

SurrealDB’s native graph model maps directly to Tencent’s fault-analysis needs: processes and infrastructure entities become nodes, relationships become edges, and engineers can traverse multi-hop dependencies to understand blast radius or root cause context. This “context graph” approach is also how many teams build knowledge graphs for operations: connecting telemetry-adjacent entities (processes, workloads, services, hosts) with the relationships that matter during incidents.

Versioned data access was equally important. Tencent used SurrealKV to validate temporal graph requirements - essentially testing how well versioned graph states could support point-in-time investigations and replaying the evolution of a process tree. For broader production scenarios, they used SurrealDB with a distributed storage layer to scale the system while keeping the operational experience cohesive.

The production deployment reflects real monitoring scale: the cluster runs with 9 storage nodes and 6 compute nodes, sustains 10,000+ QPS per day, and manages a dataset of roughly 8 million nodes and 50 million edges. That capacity enables the team to treat the graph as a first-class monitoring substrate rather than an auxiliary system used only for niche investigations.

Results: turning context into actionable insight

By consolidating many systems into SurrealDB, Tencent reduced complexity for both end users and the platform team. Analysts no longer have to begin each investigation by choosing between a time-series engine, a search index, an OLAP layer, or a separate graph database. Instead, the operational “shape” of the problem can drive the investigation: start from an alert, pull the relevant time window, traverse the dependency graph, and reconstruct the context leading up to failure.

Operationally, fewer moving parts translated into less overhead - simpler governance, fewer upgrades to coordinate, fewer cross-system inconsistencies, and a smaller surface area for failures. On the workflow side, graph-native modeling made it practical to build and query process trees as they change, supporting fast multi-hop traversal and incident reconstruction patterns that align with graph databases, context graphs, and knowledge graphs.

Looking ahead: the future of graph-driven observability

Tencent plans to extend the unified, graph-native platform across more infrastructure domains, evolving its context graph into a broader knowledge graph that connects services, processes, and dependencies in a single, queryable model.

A key focus is strengthening temporal graph capabilities, making it even easier to reconstruct historical states and analyze how relationships evolve over time. As the dataset continues to grow, Tencent also aims to build more advanced graph analytics on top of its millions of nodes and edges - turning the graph foundation into a long-term strategic asset for dependency analysis and intelligent fault resolution.