Trees of data in Elixir

What is the idomatic way to process tree data structures in Elixir?

I have a database table that holds a ton of rows which define a hierarchy by an attribute called parent_id.

Solution 1: Huge Map
A Tree module that transforms the rows into a map. For each row, it contains the mapping froim the id attribute of each row to something like %{item: row, children: [children_ids...], parent: parent_id}. Also, a list of root_ids is stored, for rows without a parent_id. So, the structure returned looks like:

%{ nodes: %{
     1 => %{item: %{...}, children: [2, 3], parent: nil},
     2 => %{item: %{...}, children: [], parent: 1},  
     3 => %{item: %{...}, children: [4], parent: 1},
     ...,
   roots: [1]}

The module has functions to get the root nodes and to fetch nodes by their id from the map. The same way, it fetches the children for a given node.

Solution 2: Using Agents
Instead of a huge map, I tried to use an Agent for every node in the tree. Its a process holding quite the same data as above, but it is easier to set up the structure from the initial list of rows.

 %{ roots: [#PID<0.122.0>] }

and an Agents with their state like

%{item: %{}, children: [PID#<0.123.0>, PID#<0.124.0>], parent: nil} # PID#<0.122.0>
%{item: %{}, children: [], parent: PID#<0.122.0>} # PID#<0.123.0>

I can even map a function over a tree to transform all the nodes while keeping the structure. But here I can shoot in my foot: If I just map a function f that transforms the agents item state, I get the behaviour of a mutable data structure.
So I did not transform the state of the agent, but created a new one with the transformed item and the parent and children properly set up.

3 Likes

You could use the erlangā€™s digraph module, which uses an ETS table as the underlying storage. An acyclic digraph is the same as a ā€œtreeā€.

Perhaps I should mention that I want to transform the data in the tree along its structure, like this:

Tree.map(tree, fn item, %{children: children_tree_sizes} -> 
    Enum.sum(children_tree_sizes) + 1
end)

to transform the tree node data to hold their tree sizes in this example. For my task, it is essential to map over a tree with acces to both the current data of a node and the transformed data of its children (bottum-up calculations).

Is it a directed acyclic graph or a tree? That is, does any node ever have more than one parent? If itā€™s a true tree then my recommendation is to just have a big nested structure of maps, it ends up working pretty well.

If itā€™s actually a DAG then something like :digraph is likely to work best. Going the agent route would be a great example of something that http://theerlangelist.com/article/spawn_or_not argues against.

3 Likes

Yes, it is a tree.

What SaÅ”a Jurić describes is someting that came to my mind lately. Elixir is functional by design, and using agents I could easily reinvent OOP by encapsulating state, only with objects come to life as a process.

What I think about now is, if I use a single big nested map, is there a way to use parallelism? Using the whole map, I have to use Enum.reduce to transform the children of a node. If I use agents instead, I can transform the children by using Enum.map in parallel, for each one has its own data seperated. I just have to keep in mind to return a new agent for each and donā€™t reuse the existing one.

To be completely honest Iā€™ve tried this approach myself way back when, in my Erlang days. Thatā€™s why Iā€™m confident itā€™s not a good approach :smiley:

Itā€™s hard to say, because the problem is vaguely described, but I think concurrent processing should be possible with a flat huge map (though canā€™t say if itā€™s worth it for your case). However, if you need to process from bottom to top, it might be better to organize the map as child to parent relationships and the list of leaves.

You also suggest that the amount of data is huge. That might lead to a large active heap, and cause some larger GC pauses in the owner process, as I briefly discussed here. If thatā€™s the case, :digraph might be a good fit, because itā€™s based on ETS. Or otherwise, your hand-rolled ETS table. This might also simplify concurrent processing.

I definitely donā€™t advise agents (or processes). Itā€™s going to be 2kb overhead per node (and you suggest there are a lot of them). Itā€™s also going to involve a lot of extra message passing and scheduler overhead. With some careful work, you should be able to achieve whatever you want without agents. A large map or a nested data structure would be functional solutions, while ETS is an optimization for the cases when you have a frequently changing large active working set.

4 Likes

Actually, on second thought Iā€™m not convinced about this part of my answer, so please disregard this sentence :slight_smile:

Iā€™m not sure if everyone else is over-thinking it or if Iā€™m just lazy, but I would in no way try to build up a graph there for a few reasons:

First, no cycles are possible in a baked structure in an immutable language, this is a ā€˜goodā€™ thing.

Second, you are pulling them from a database and everything has an id key as well as a parent_id key, what happens if you have a cycle (database will not catch that)?

Iā€™d just grab the things, stuff them into a map with the key being the id then anytime I need the parent I just do all_of_them[this_thing.id] as I need. It is fast, no duplicating of memory, no giant trees in memory, cycles are represented (your job to break out of cycles if so), etcā€¦ etcā€¦

2 Likes

The way your database is set up, there is nothing preventing cycles from happening. Also, when you want to query all nodes back up the tree, or the subtree starting at a certain node, you will require N database queries (I think? Maybe there is some archaic black magic to follow all the parent IDs in SQL directly, but it isnā€™t going to be pretty).

What I have seen other libraries, most notably the Ruby gem ā€˜Ancestryā€™, do, is to store a list of IDs of the path back up the tree. This way, cycles are prevented, and you can also select a subtree at once.

As for how to represent them in memory: Hex.PM contains Arbor, zipper_tree, Garph, and quite possibly some more packages that can help with this (disclaimer: I have not inspected them in detail). The basic idea of a ā€˜rose treeā€™ like the one you are attempting to build, is to have a node, which has a list of children (each of these being a node again). In Haskell parlance:

data Tree a = Node a [Tree a]
In Elixir, this would be something like

defmodule Tree do
 defstruct val: nil, children: []
end

Traversing these kind of trees is simple: Given a node, do something with the value, and then call it recursively for all the children (this is pre-order traversal, post-order traversal would be to first go to the children, and then the current node). An implementation of this can be seen in Macro.prewalk and Macro.postwalk: Yes, quoted elixir code is also a tree: an Abstract Syntax Tree (Although it is of course not wrapped in a stuct).

3 Likes