Graphing Elixir (ideas on improving observability)

tl;dr - Some Partly-Baked Ideas on improving observability…

The Elixir ecosystem has many tools for supporting observability: mostly, they collect and/or display data about running systems. However, almost none of these tools provide information about the relationships between entities (e.g., applications, nodes, processes) and events (e.g., function calls, message handling, process life cycles) in these systems.

This is rather unfortunate, because (a) data without context can be difficult to interpret and (b) some extremely large and complex systems are being built. So, I’d like to start a discussion of possible ways to improve the situation.

Let’s assume that our high-level goal is to help developers explore and understand the systems they’re working on. Graph-based collection, organization, and presentation of information can serve this goal both directly (e.g., via web-based text and diagrams) and indirectly (e.g., via LLM-based tools).

Erlang’s Observer

Observer is the only tool I am aware of that directly supports collection and presentation of graph-based information about Elixir programs (other suggestions welcome!). Although it has limited capabilities and scope, Observer is clearly a popular and valuable tool; thus, it’s a good starting point for this discussion.

Observer’s Applications tab displays diagrams of application supervisor trees (showing supervisor-worker links) and process trees (showing parent-child links). Each node can display a few details such as its name, ID, and state. Portions of these diagrams can also be selectively collapsed, for brevity. All told, this tab can be very useful for system exploration and understanding.

Limitations

As noted above, however, the tab has a number of limitations. Here are some obvious ones:

  • accessibility - doesn’t play nicely with screen readers
  • connectivity - doesn’t support data import or export
  • extensibility - no way to add/change functionality
  • flexibility - few ways to modify the output format
  • generality - only displays a few types of information
  • organization - no tools for organizing, searching, etc.

Speculation

In an ideal world, all of these (and other!) limitations would be addressed. However, let’s start with speculation about near-term possibilities, based largely on existing standards and technology.

Accessibility

Observer’s GUI is implemented via the wxWidgets library. So, it isn’t accessible by blind users (e.g., using a screen reader). An LLM might be able to ingest the displayed content, but this would require some heavy lifting: image analysis, optical character recognition, etc.

Fortunately, there are various ways to create accessible and navigable renditions of graph-based information. The most obvious approach is to use HTML et al. For example, graphs can be represented using descriptive text, adjacency and/or edge lists, tables, and/or trees. All of these can be displayed using Semantic HTML, which provides useful hints for screen readers.

Conveniently, the textual content and semantic markup which help screen readers to function can also be harvested and interpreted by LLMs. (Indeed, web pages are already a major source of LLM training data.) This offers the possibility of using an LLM chatbot to access system information, data, etc. Both blind and sighted users could benefit from this approach.

It should also be possible to generate accessible diagrams (e.g., via the DOT graph description language and Accessible SVG). Of course, getting an LLM to ingest these diagrams could be tricky…

Connectivity

System information can be collected from various sources, using existing logging and tracing facilities (e.g., Lager, OpenTelemetry, :trace). Although these tools aren’t specifically designed to handle information on attributes and relationships, a structured logging approach (e.g., using JSON, JSON-LD, or Turtle) could be used to encode information on nodes, links, etc.

Extensibility, etc.

Assuming that the collected information is encoded as message data, Elixir (or other) processes can be used to produce any desired functionality. This (waves hands a bit…) pretty much covers the issues of extensibility, flexibility, and generality.

Organization

Although OpenTelemetry et al support time-based storage and retrieval of data, they aren’t well suited to supporting graph-based queries. A relational database such as PostgreSQL can be used for this, but it may not be convenient, let alone performant.

What we really want (IMHO) is a graph database. Lacking a BEAM-based option, my preference would be to use open source offerings such as Neo4j and/or ArangoDB. Both of these are implemented in Java:

  • Neo4j is a popular and well-supported graph database. Because the graph representation is kept in memory, link traversal only requires following a pointer. Inverted indexes also accelerate searching. In short, Neo4j’s query performance is outstanding.

  • ArangoDB is a multi-model database which supports graph-based storage and queries. It uses JSON as its base data format, easing interoperability with other programs, but it can also import graph information from Neo4j.

Both of these projects are actively working on ways to take advantage of LLMs, etc. It’s still early days, but there doesn’t seem to be any reason that an LLM could not make good use of information from a graph database. Details at 11…

Seeding the database

Although I’ve concentrated thus far on dynamic information, immutable and/or (largely) static data could also be used to “seed” the database with useful context. This could allow queries to traverse complex sets of relationships. For example, here is a (suitably Sci-Fi) prompt for an LLM chatbot:

Who is sending unhandled messages to process Foo? Include information on the senders (e.g., function calling trees and definitions), as well as the sending processes.

To support this sort of inquiry, foundational definitions (e.g., “functions are defined in modules”) could be harvested from the online Elixir documentation. Web sites such as Hex.pm and HexDocs.pm could provide useful information on libraries. Finally, application and library source code could be processed to obtain compile-time information.

Final Thoughts

In summary, I’m suggesting that ways be found to harvest, organize, and present graph-based information on Elixir (etc.) systems, using mostly existing tooling and standards. Does anyone else find this notion attractive? (Ducks…)

-r

3 Likes

You mention the lack of a graph database on the BEAM. Just to note that a few days ago on the forum someone posted a NIF wrapper for the Rust client API of KuzuDB.

Admittedly KuzuDB itself is written in C#. But it might be a good alternative to Neo4j since it is also a directed property graph, is (truly) open source, is embeddable, and has adapted Cypher as its query language.

I haven’t yet played with KuzuDB but watching recent videos on YouTube from the developers, Rust is being treated as a first class client API, and they’ve recently added in-memory graph databases.

5 Likes

Thanks for the pointer! KuzuDB (aka Kùzu) looks very interesting. I particularly like the possibility of embedding it as a NIF, but its emphasis on performance and interoperability is also attractive.

That said, I have a couple of clarifications/corrections to offer:

  • KuzuDB is primarily built in C++20, with assistance from Python (>= 3.9) and a few other languages.

  • Neo4j offers several licensing options: free open source, free commercial (for individuals, startups and academics) and paid commercial. Neo4j Community Edition (CE) is licensed under the GNU General Public License (GPL) v3.

-r

I think that 5.8% of the codebase that is Python is the Python API code in the tools directory. Similarly for the 1.8% of Rust. And maybe I’m misinterpreting the implications of it being an embedded database from the following:

What is an Embedded Database: “An embedded database is an in-process database management system that’s tightly integrated with the application layer. The term “in-process” is important because the database compute runs within the same underlying process as the application (which could be written in any language, like Python, R, JavaScript, C++).”

I’ve been using Neo4j for 15 years, and it changed its open source licensing for v3.4 and v3.5 in 2018. Key features like hot backups and the recently-added Change Data Capture feed are Enterprise only. It’s worth a read of the Neo4j, Inc. v. Purethink, LLC lawsuit over a fork of Neo4j to get a sense of otherwise secretive enterprise licensing costs, and in the final paragraph, what is interpreted as permissible under its open source licensing.

It appears that the main (C++) portion of KuzuDB can be embedded in (i.e., linked into) another C/C++ application. Using a NIF allows this to be done in the context of the BEAM. That is, a copy of KuzuDB would be linked into a BEAM node (i.e., executing Erlang runtime system).

I suspect that KuzuDB’s processing and memory needs might be quite substantial. Also, some interface code would be needed to use any graph database in the sort of facility we’re discussing. So, I’d probably want to bundle the KuzuDB NIF and interface code into their own node.

My suspicion regarding the Neo4j licensing snafu is that it wouldn’t be a problem for this (relatively small and undemanding) use case. Still, the fact that Neo4j and KuzuDB both use Cypher might allow some flexibility.

In any event, we should probably climb back out of these implementation details and find out if anyone wants to comment on the general notion…

-r

FWIW, the thread Future of Logger in Elixir touches on various related topics, including Logger, LoggerBackends, :logger, structured logging, and telemetry. My takeaway from this is that there are lots of ways to harvest dynamic information from Elixir systems.

I’d also like to expand on the notion of using web technology as a way to present graph-based information. Let’s assume that we have a graph database containing all sorts of information on a running system. It would be pretty simple for a Phoenix subsystem to dynamically generate a page for each entity in the database.

Relationships could be mapped into HTML links, allowing easy navigation. Finer details (e.g., attributes) could be displayed using lists, tables, etc. And, as noted above, all of this could be done using Semantic HTML.

Because most users (both human and LLM) already know how to navigate and peruse web pages, this sort of presentation could be immediately useful. And, as web and LLM technologies continue to develop, this approach could take advantage of any improvements.

I think this is a poor way of representation. In my mind, the only way to represent runtime systems is only as it is done currently, with charts that show telemetry that are time spanned or via a simulation. I’m a big fan of cisco packet tracer since the university, they managed to represent some infinitely complex networking concepts in a way that even a kid would understand, we could learn a lot from that approach.

Graphical representations such as line plots can work very well for summarizing numerical information (e.g., comparing dynamic quantities over time). So, I have no problem with including them in dashboards, etc. However, they also have some limitations.

As a sighted individual, I find graphical representations to be very convenient. Unfortunately, they aren’t very accessible for blind users, LLMs, etc. So, it’s nice to have alternatives that everyone can use.

Another limitation has to do with representing relationships between entities in a (possibly directed) graph. That’s why Observer’s Applications tab uses network diagrams to show connectivity. And, because large diagrams can be hard to interpret, it allows subtrees to be collapsed.

Getting back to the question of using sets of web pages to represent large directed graphs, please consider Wikipedia. It has about seven million content pages (and about sixty million pages altogether). Each of these pages tends to contain dozens of links. Nonetheless, many users find Wikipedia easy to explore.

That said, I’d like to see more use of network diagrams to show the local neighborhoods of Wikipedia (and some other) web pages. For example, a diagram could show all of the first and second-order neighbors of a page, including all their interconnecting links. But I digress…

Indeed, I think we discuss about different things. My last concern is to make this information LLM friendly.

My interest is in delivering a solution oriented exclusively around people psychology and around the senses we use for learning, be that sound, images or text. IMO we delve too much into textual representations currently in software development industry and this limits us greatly in the way we can perceive, absorb and understand information.

I’d certainly agree that text-based representation is dominant and that other possibilities should be tried out. FWIW, I’ve used DOT extensively for creating network diagrams (including animations :-}) and wish Mermaid was as powerful. I also admire and have played around with D3.

Bret Victor has done some very intriguing presentations on data exploration, mostly using highly interactive computer graphics. He is also interested in letting folks use physical manipulation as a tool for reasoning. For example, he talks about how Drs. Watson and Crick used physical models while trying to understand the structure of DNA.

1 Like

Have you looked at Phoenix LiveView Dashboard. It’s modular, web based, live, and more.

1 Like

As I wrote: “The Elixir ecosystem has many tools for supporting observability: mostly, they collect and/or display data about running systems.” Phoenix LiveDashboard is certainly one of these tools and I’d expect it to play a large role in any graph-aware upgrade to the ecosystem.