Which data structure would you use to build a suffix tree in Elixir?

I am interested in implementing suffix trees in Elixir, with the primary goal of building a data structure that will allow fast autocomplete/autosuggest in LiveView. I see that there is an implementation of tries on Hex, but I think it’s important that the user not be required to match the string from the beginning. Given the string “Add New Account,” I would like the user to receive the suggestion for any substring (for example, “new,” or “acc”).

I have been reviewing this excellent article on using Ukkonen’s method to construct a suffix tree. There are some code examples on Rosetta Code, and I also found this article and this book chapter helpful.

At first, I thought it would be as simple as nesting linked lists. I had the idea that the algorithm could simply traverse the top level of the list until the beginning of the entered text was matched, and then move on to traversing the first nested list until further matches, and so on. It even appeared that the usual leaf marker, $, could be eliminated, because we could simply build up the matched string and return it when we arrived at [head | []].

However, further investigation has made me aware that I dramatically underestimated the complexity of constructing such a tree. Even using tuples to provide additional information ({:node, "substring"} or {:leaf, "substring"}), does not address the fact that Ukkonen’s method of construction requires links between nodes of the tree (meaning, jumping between nodes, rather than traversing a list in order).

Perhaps it isn’t possible to use existing Elixir data structures to create such a suffix tree, as the tree itself is a data structure with unique properties. In any event, I am interested in putting some time into this, but I really need to better frame the problem and the approach first.

The ultimate goal would be a Hex package with an API along these lines:

defmodule SuffixTree do
  @doc """
  Takes a list of strings, sorts them, and constructs a suffix tree.
  """
  def build(list, opts \\ []) do
    #...
  end

  @doc """
  Takes a user-entered substring and returns a list of matches from the given suffix tree.
  The number and sorting of matches are determined by opts.
  """
  def match(tree, substring, opts \\ [])
    #...
  end
end

My impression at this stage is that the building of the tree could be greatly simplified if it were done asynchronously and you didn’t care how long it took. The purpose of all the links between nodes in Ukkonen’s method appears to be a reduction in the time it takes to build the tree, from O(m2) to O(m) for a string of length m. Perhaps a full implementation of Ukkonen isn’t needed for smaller sets of strings.

How would you approach this? Am I missing some prior art that has already solved this in Elixir? Should I just hit the ElasticSearch docs and forget about this? Thanks in advance for any feedback.

2 Likes

The core idea seems like :digraph from the Erlang stdlib could help, but that’s not likely to get comparable performance to the optimized algorithm.

Consider borrowing a technique from :digraph, however: it uses ETS tables to store vertexes / edges / etc mutably with random access. I used a similar-but-vastly-simpler approach for this Advent of Code problem - my solution uses an ETS table containing three-element tuples {marble, previous_marble, next_marble} to simulate a doubly-linked circular list.

1 Like

Thanks! I will checkout :digraph. Bit by bit, I am getting closer to the heart of this. I found an implementation by Danny Yoo in Racket (the actual code is here), which is starting to look a bit more complete (despite my ignorance of Racket itself). In the references, there is a book by Dan Gusfeld that appears to be the best available reference on the matter.

Gusfeld’s explanation of the algorithm is probably the most detailed one out there, and Yoo’s implementation is enough to convince me that this should be possible in Elixir. I need to take the time to fully grasp Yoo’s code, but it seems like the apparent branching architecture of the tree can be simulated by using labels that indicate the parent and children of each node. Walking the tree is then possible by referencing these labels.

There’s also a pure elixir digraph implementation - https://github.com/bitwalker/libgraph - performant and doesn’t use ETS

1 Like

Awesome! Thanks for this.

did you manage to write suffix tree construction in elixir and do you mind sharing? :slight_smile:

Oh gosh, blast from the past.

The short answer is no, I never finished it. There are 2 main reasons:

  1. I migrated to using Rust primarily.
  2. The available implementations in Rust made it clear to me that performance is much better for suffix arrays, and most modern implementations use an optimization like that, rather than copying Ukkonen. For example, Andrew Gallant’s implementation, and specifically the articles referenced in his notes.

You can see the work I had done on the dev branch of this repo. It’s a long way from finished, but I had laid out a strategy (see ukkonen.md and the code comments).

4 Likes