Storing system information in a graph database?

Rich_Morin · February 27, 2023, 2:46am

tl; dr - I’d like to populate Neo4j with system information

Overview

I’d like to make both static and dynamic system information available as a labeled property graph, using a graph database such as Neo4j. This would start with sets of source code (e.g., applications), building graphs of atoms, functions, modules, etc. It would then fold in dynamic system information, tracking messages, nodes, processes, VM activity, etc. And a pony…

The basic idea is that anything and everything should be up for grabs, from atoms through functions to modules, nodes, processes, and more. If it looks plausibly useful, track and record it for later exploration. (It’s all good…)

Use Cases

Once the data is harvested and saved, various clients could access and process the result. For example, front ends could present both summarized and specific system information, providing dynamic and historical context. As in:

Which processes have recently used this function?
Which nodes are currently running this app, module, etc?
What’s the status and recent history of this process?
…

I’m not at all sure about how this data might be employed, but it seems likely to be useful. FWIW, I plan to start by using Cypher and assorted Neo4j Tools for initial exploration and tuning. After that, setting up some Livebooks and a Liveview web interface might be worth considering. Finally, if the database proves useful, one or more presentation front ends might opt to take advantage of it.

Status, etc.

At this point, I’m casting about for convenient (and preferably, low impact) ways to harvest app and system information. Fortunately, some broad-spectrum data collection facilities already exist, including ElixirLS and recon (other suggestions welcome!).

In this context, I’d like to know what other options exist for getting useful system snapshots. Could anyone point me to relevant APIs, data structures, etc? Basically, I’d like to know what data is readily available and what things seem most likely to be useful. As usual, I welcome advice, comments, reactions, etc.

-r

Rich_Morin · February 27, 2023, 8:46am

tl;dr - some possibly useful background information

It occurs to me that some readers may not be familiar with Labeled Property Graphs in general or Neo4j in particular. More to the point, it may not be obvious why these could be a good fit for storing and accessing system information. So, here’s a (somewhat high level and opinionated) rundown…

A labeled-property graph model is represented by a set of nodes, relationships, properties, and labels. Both nodes of data and their relationships are named and can store properties represented by key–value pairs. Nodes can be labelled to be grouped. The edges representing the relationships have two qualities: they always have a start node and an end node, and are directed; making the graph a directed graph. Relationships can also have properties. This is useful in providing additional metadata and semantics to relationships of the nodes. Direct storage of relationships allows a constant-time traversal.

– Graph_database, in Wikipedia

Careful Reader will notice that this overloads the term node. That is, a Neo4j node can describe an Erlang node, but it could just as easily describe a function, module, process, etc. To reduce confusion, I’ll use the term “entity” below for the Neo4j meaning. However, note that this is not the only linguistic clash between Neo4j and Erlang: other issues include name syntax, etc. So, YMMV…

Stored Information

We’d like to store several kinds of information:

generic entity definitions (e.g., what is a function?)
possible relationships (e.g., modules can define functions)
specific entities (e.g., module Foo, function Foo.bar/1)
specific relationships (e.g., Foo.bar/2 calls Foo.bar/1)

Some of this information is generic; other information is specific to a given code base and/or set of running processes. So, we should gather each type of information when it either becomes available or changes.

Access, Data Structures, and Performance

The Cypher query language isn’t the only way to access Neo4j, but it’s definitely the Golden Path. Cypher syntax looks a bit like a cross between SQL and DOT (graph description) notation, with a bit of pattern matching thrown in. So, a typical query might look something like this:

MATCH (nicole:Actor {name: 'Nicole Kidman'})-[:ACTED_IN]->(movie:Movie)
WHERE movie.year < $yearParameter
RETURN movie

Neo4j has indexes that make it reasonably fast at finding specified entities and relationships, but the real speedup comes when we start traversing chains of relationships. Neo4j keeps connectivity information in memory, so following a relationship only requires dereferencing a pointer. So, it’s real fast at finding related entities, even when multiple relationships are involved.

This means that answering a multi-step “join” like “list all processes running on node name@host that have recently called function Foo.bar/1 via Foo.bar/2” can be expected to happen quite quickly. So, asking a Livebook to generate and display a Mermaid diagram of this sort of information isn’t a crazy idea.

A Possible Use Case

Serendipity for the win… I just stumbled on a possible use case, described by Peer Stritzinger at Code BEAM V EU 21: Jumping gen_servers! A new way of building (…) applications.

In this talk, Peer describes a robotic assembly line which uses a mesh of Erlang nodes to control all of the stages and types of actions involved. In order to migrate processes to the most relevant nodes, he needs a source of information on the entire system.

Because he doesn’t want a single point of failure, a single Neo4j database would not suffice. However, multiple database clones might work just fine. And, even if he wants to keep critical information “in the mesh”, he could use the Neo4j database(s) as a convenient data source.

HTH, Rich

peerst · February 27, 2023, 11:20am

Thanks for the citation notification Rich.

For our use case we need a more lightweight solution than a replicated neo4j instance. We have two graphs to manage: one is the application data flow graph which can be coming from the user in an IDE we are building. The other is the network & node property graph. The former is relatively static and could be a use case for neo4j, however we need to have it (at least partially) replicated on all nodes to allow decentralised decision making.

The network and node property graph can be much more dynamic and we need fast updates to it to be able to move computation around new obstacles quickly. For that a classical Link State Protocol (LSP) is implemented between Erlang Nodes (very simplified, each nodes floods the net with its info plus the surrounding topology on each local change). Here again we need a dataset on all nodes to be able to implement the LSP even but also again for local decision making. Classically the Link State Packets are stored as is and graph algorithms are directly implemented on this dataset. Alternatively one can also with little extra effort store them in a https://www.erlang.org/doc/man/digraph.html which is a data structure added to OTP for exactly such use cases.

https://www.erlang.org/doc/man/digraph.html can store sizeable datasets since its underlying storage is a ETS table. It has a nice set of graph algorithms pre-implemented and its not hard to add others du to its efficient underlying data-structure. Its a very useful and often neglected part of OTP