I am interested in implementing suffix trees in Elixir, with the primary goal of building a data structure that will allow fast autocomplete/autosuggest in LiveView. I see that there is an implementation of tries on Hex, but I think it’s important that the user not be required to match the string from the beginning. Given the string “Add New Account,” I would like the user to receive the suggestion for any substring (for example, “new,” or “acc”).
I have been reviewing this excellent article on using Ukkonen’s method to construct a suffix tree. There are some code examples on Rosetta Code, and I also found this article and this book chapter helpful.
At first, I thought it would be as simple as nesting linked lists. I had the idea that the algorithm could simply traverse the top level of the list until the beginning of the entered text was matched, and then move on to traversing the first nested list until further matches, and so on. It even appeared that the usual leaf marker, $
, could be eliminated, because we could simply build up the matched string and return it when we arrived at [head | []]
.
However, further investigation has made me aware that I dramatically underestimated the complexity of constructing such a tree. Even using tuples to provide additional information ({:node, "substring"}
or {:leaf, "substring"}
), does not address the fact that Ukkonen’s method of construction requires links between nodes of the tree (meaning, jumping between nodes, rather than traversing a list in order).
Perhaps it isn’t possible to use existing Elixir data structures to create such a suffix tree, as the tree itself is a data structure with unique properties. In any event, I am interested in putting some time into this, but I really need to better frame the problem and the approach first.
The ultimate goal would be a Hex package with an API along these lines:
defmodule SuffixTree do
@doc """
Takes a list of strings, sorts them, and constructs a suffix tree.
"""
def build(list, opts \\ []) do
#...
end
@doc """
Takes a user-entered substring and returns a list of matches from the given suffix tree.
The number and sorting of matches are determined by opts.
"""
def match(tree, substring, opts \\ [])
#...
end
end
My impression at this stage is that the building of the tree could be greatly simplified if it were done asynchronously and you didn’t care how long it took. The purpose of all the links between nodes in Ukkonen’s method appears to be a reduction in the time it takes to build the tree, from O(m2) to O(m) for a string of length m
. Perhaps a full implementation of Ukkonen isn’t needed for smaller sets of strings.
How would you approach this? Am I missing some prior art that has already solved this in Elixir? Should I just hit the ElasticSearch docs and forget about this? Thanks in advance for any feedback.