tl;dr - Some Partly-Baked Ideas on improving observability…
The Elixir ecosystem has many tools for supporting observability: mostly, they collect and/or display data about running systems. However, almost none of these tools provide information about the relationships between entities (e.g., applications, nodes, processes) and events (e.g., function calls, message handling, process life cycles) in these systems.
This is rather unfortunate, because (a) data without context can be difficult to interpret and (b) some extremely large and complex systems are being built. So, I’d like to start a discussion of possible ways to improve the situation.
Let’s assume that our high-level goal is to help developers explore and understand the systems they’re working on. Graph-based collection, organization, and presentation of information can serve this goal both directly (e.g., via web-based text and diagrams) and indirectly (e.g., via LLM-based tools).
Erlang’s Observer
Observer is the only tool I am aware of that directly supports collection and presentation of graph-based information about Elixir programs (other suggestions welcome!). Although it has limited capabilities and scope, Observer is clearly a popular and valuable tool; thus, it’s a good starting point for this discussion.
Observer’s Applications tab displays diagrams of application supervisor trees (showing supervisor-worker links) and process trees (showing parent-child links). Each node can display a few details such as its name, ID, and state. Portions of these diagrams can also be selectively collapsed, for brevity. All told, this tab can be very useful for system exploration and understanding.
Limitations
As noted above, however, the tab has a number of limitations. Here are some obvious ones:
- accessibility - doesn’t play nicely with screen readers
- connectivity - doesn’t support data import or export
- extensibility - no way to add/change functionality
- flexibility - few ways to modify the output format
- generality - only displays a few types of information
- organization - no tools for organizing, searching, etc.
Speculation
In an ideal world, all of these (and other!) limitations would be addressed. However, let’s start with speculation about near-term possibilities, based largely on existing standards and technology.
Accessibility
Observer’s GUI is implemented via the wxWidgets library. So, it isn’t accessible by blind users (e.g., using a screen reader). An LLM might be able to ingest the displayed content, but this would require some heavy lifting: image analysis, optical character recognition, etc.
Fortunately, there are various ways to create accessible and navigable renditions of graph-based information. The most obvious approach is to use HTML et al. For example, graphs can be represented using descriptive text, adjacency and/or edge lists, tables, and/or trees. All of these can be displayed using Semantic HTML, which provides useful hints for screen readers.
Conveniently, the textual content and semantic markup which help screen readers to function can also be harvested and interpreted by LLMs. (Indeed, web pages are already a major source of LLM training data.) This offers the possibility of using an LLM chatbot to access system information, data, etc. Both blind and sighted users could benefit from this approach.
It should also be possible to generate accessible diagrams (e.g., via the DOT graph description language and Accessible SVG). Of course, getting an LLM to ingest these diagrams could be tricky…
Connectivity
System information can be collected from various sources, using existing logging and tracing facilities (e.g., Lager, OpenTelemetry, :trace). Although these tools aren’t specifically designed to handle information on attributes and relationships, a structured logging approach (e.g., using JSON, JSON-LD, or Turtle) could be used to encode information on nodes, links, etc.
Extensibility, etc.
Assuming that the collected information is encoded as message data, Elixir (or other) processes can be used to produce any desired functionality. This (waves hands a bit…) pretty much covers the issues of extensibility, flexibility, and generality.
Organization
Although OpenTelemetry et al support time-based storage and retrieval of data, they aren’t well suited to supporting graph-based queries. A relational database such as PostgreSQL can be used for this, but it may not be convenient, let alone performant.
What we really want (IMHO) is a graph database. Lacking a BEAM-based option, my preference would be to use open source offerings such as Neo4j and/or ArangoDB. Both of these are implemented in Java:
-
Neo4j is a popular and well-supported graph database. Because the graph representation is kept in memory, link traversal only requires following a pointer. Inverted indexes also accelerate searching. In short, Neo4j’s query performance is outstanding.
-
ArangoDB is a multi-model database which supports graph-based storage and queries. It uses JSON as its base data format, easing interoperability with other programs, but it can also import graph information from Neo4j.
Both of these projects are actively working on ways to take advantage of LLMs, etc. It’s still early days, but there doesn’t seem to be any reason that an LLM could not make good use of information from a graph database. Details at 11…
Seeding the database
Although I’ve concentrated thus far on dynamic information, immutable and/or (largely) static data could also be used to “seed” the database with useful context. This could allow queries to traverse complex sets of relationships. For example, here is a (suitably Sci-Fi) prompt for an LLM chatbot:
Who is sending unhandled messages to process Foo? Include information on the senders (e.g., function calling trees and definitions), as well as the sending processes.
To support this sort of inquiry, foundational definitions (e.g., “functions are defined in modules”) could be harvested from the online Elixir documentation. Web sites such as Hex.pm and HexDocs.pm could provide useful information on libraries. Finally, application and library source code could be processed to obtain compile-time information.
Final Thoughts
In summary, I’m suggesting that ways be found to harvest, organize, and present graph-based information on Elixir (etc.) systems, using mostly existing tooling and standards. Does anyone else find this notion attractive? (Ducks…)
-r