Preferred data representation(s) for property graphs?

Rich_Morin · August 13, 2023, 10:06pm

I have a (very WIP) project that will be making some graph-structured data generally available. I’d really like it to be accessible and convenient for AI systems (e.g., LLMs) and their developers. So, I’d like to know whether there are any preferred data representation(s) for property graphs, including pitfalls, etc.

Background

Internally, the data is maintained in an Elixirish version of GraphQL. This (WIP) approach retains GraphQL’s ability to let clients dynamically request subsets of a published data set, but in a BEAM-friendly manner (Actors, messages, terms).

So, the native data structures are dynamic (i.e., run-time friendly) Elixir terms: nested maps, lists, and scalars, but few atoms or structs. A GraphQL front end is an obvious addition, because the semantics are extremely similar. However, this is sort of a “raw” interface.

The next-level structure is a property graph (i.e., entities and relationships, both with attached attributes). Used by means of a property graph database (think Gremlex or Neo4j), this provides access, navigation, organization, processing, and storage.

Although a property graph can be expressed in terms of maps and such, the underlying data structures can also be finessed by graph query languages such as Cypher.

In summary, there will be a way to generate almost any desired concrete data representation. Which brings us back to the original question: what data format(s) would AI code and coders find the most palatable? (ducks)

-r

polvalente · August 14, 2023, 1:18pm

I think it would be nice if we could have an easy way to convert to/from Explorer DataFrames and Nx.Tensors. Those are the two main ML data structures we have in the Nx ecosystem and having a way to leverage them would help bridge the gap from your library to the rest of the ecosystem.

This doesn’t mean they should necessarily be the main output format, but conversion functions would be nice.

Rich_Morin · August 14, 2023, 3:10pm

AFAICT, the Explorer module handles both series and dataframe representations; that is, single lists and sets of parallel lists:

We have two ways to represent data with Explorer:

using a series, that is similar to a list, but is guaranteed to contain items of one data type only - or one dtype for short. Notice that nil values are permitted in series of any dtype.

using a dataframe, that is just a way to represent one or more series together, and work with them as a whole. The only restriction is that all the series shares the same size.

Tensors, in contrast, seem like arrays with some added metadata:

Nx.Tensor is a generic container for multidimensional data structures. It contains the tensor type, shape, and names.

Unfortunately, neither of these representations is much like the trees of maps that GraphQL and my own hack use. So, I’m not at all sure how to:

capture the tree’s hierarchical structure
specify the desired data extraction, etc.

Clues and suggestions would be most helpful; ELI5…

-r

Explorer DataFrames

Nx Tensors

polvalente · August 14, 2023, 8:34pm

If you’re dealing with nested data, maybe just the leaves could be converted to Explorer or Nx representations. Explorer seems a bit more versatile in this sense. Assuming a nesting of maps, it would be as easy as some combination of get_in(data, [key1, Access.all(), key2, ...]) |> Explorer.DataFrame.concat_rows or something along those lines to un-nest and obtain something easily usable.

That is, your data structure would still carry the hierarchical information, and as soon as the user is ready they can extract the hierarchy leaves into something compact for processing.

Rich_Morin · August 14, 2023, 9:23pm

That sounds quite promising, for the GraphQLish API. However, it may not be relevant to the task of getting from a graph database to the outside world. Specifically, Neo4j (or whatever) will have ways to export data. So, the problem may come down to selecting from the available offerings. However, it strikes me that we may be missing something.

Naive speculation…

LLMs are already doing a lot of handling of semi-structured data. I suspect that most of this is in the form of HTML, with the rest being in PDF et al. So, they have all sorts of markup (generally semantic) to digest.

Also, most web sites contain internal links that form a directed graph. Some of these (e.g., Hex.pm, HexDocs.pm, Wikidata/Wikipedia, large commercial sites) have substantial and well-manicured graphs.

These graphs contain a lot of relationship information, which I presume LLMs can exploit to at least some degree (clues welcome…). For example, when they digest Wikipedia (and presumably Wikidata), how much advantage do they take of the links?

Anyway, I wonder whether automagically generated HTML, via LiveView, might not be easily digestible by both humans and LLMs. Nestled in these pages, or perhaps available for “download”, could be all sorts of tensors, etc. Does this make sense?

-r