Call for criticism: a minimal Trie library

dimitarvp · February 24, 2020, 1:02am

Hey community,
Lately I’ve noticed that my very small Trie library that I used in two hobby projects for live prefix searching and autocompletion is sitting in GitHub gathering dust.

I’d like to see if it can be brought into a state where even a few people would use it.

To that end, I’d like to ask anyone who is willing to clone the repo, run the tests, run mix dialyzer (there’s an error there that I couldn’t decipher and fix yet so advice is welcome) and check the docs, to give me their worst criticism.

A few things I’d be interested to hear:

Is this library giving you a good way to use a quick prefix search (or autocompletion) after you feed the data structure a list of words? (via Trie.put_words)
Do you think my implementation deviates from the official meaning of the Trie data structure?
Are the comments / docs good enough? Are they hard to read? Do they even make sense?
Do you feel the test coverage is good?
Do you have any code criticism? Bad variable names? Too short / too long functions? Bad algorithms? (That was like my 3rd or 4th Elixir coding exercise but please, do not be gentle; I need honest feedback.) Anything at all that you dislike? Please share it.
Do you think me insisting on integrating this data structure with Access is too much? I mean, you can work pretty easily with it without having to use Access and the Kernel functions that make use of it. Just interested in your opinion if that support is adding code weight without much benefit to it.

I don’t claim anything about the coding quality or the scope of this miniature project. This was one of my very first Elixir projects. I’d definitely approach specs and guards much more strictly now and I’m very likely to have chosen different function names but outside of that I am generally satisfied by the little thing. I still use it in those two hobby projects and haven’t had problems with it.

But please, do your worst and rip it apart if you decide to spare the time (the only file is 373 lines long as of the time of writing this post). I’ll be grateful.

Thanks for reading.

ityonemo · February 24, 2020, 1:58am

On a first glance, it’s really nice. My only nitpick is that your t type should probably be __MODULE__ and probably opaque, to reflect your comment about not counting on internals, and correspondingly move it out of @moduledoc and into code comments.

A property test would also make me more confident though it is a well known data structure that it’s easy to verify you’ve done correctly.

axelson · February 24, 2020, 8:22am

I don’t have time to look at the code right now but I was wondering if you could give an example or two of when you would want to use this library instead of the built-in data structures.

aenglisc · February 24, 2020, 9:45am

I don’t really like that public and private functions are intermixed. I personally prefer to have all the public functions and the documentation at the top of the module with the implementation being separate. In fact I like making separate modules for documentation and implementation using delegates, but that might be my personal kind of madness

Update: I ran dialyzer and made a pr that seems to fix the error.

Qqwy · February 24, 2020, 9:50am

One tip I have is to move some of the shorter testing functions to doctests, which will help to improve the documentation. Besides this kind of data structure being very testable with property-based testing, as @ityonemo already mentioned, I think that some more code examples in the documentation would be helpful.

As for the code: There are some places where I would prefer a function call written on one line rather than split over many lines, but I believe the Elixir formatter has some heuristics that do not agree with my opinion here.
One of the things I have a question about is why in (almost?) every place you pattern-match on the struct, you also pattern-match on that children is a map (%__MODULE__{children: %{} = children} ). I think it would be better to make your type opague and just leave those matches out. In many cases you can also just use trie.children rather than pattern-matching on the struct’s fields in the function head at all; that is only really useful if you have different clauses you want to do depending on the contents of children.

I think integrating it with Access is not too much, and probably a good idea, because it means that it is easier to integrate it with existing programs that use the Access protocol.

dimitarvp · February 24, 2020, 10:31am

Imagine loading all tags in your DB in the Trie:

Tag
|> Repo.all(select: [:name])
|> Trie.put_words()

And then when people want to put tag(s) on an object you have an autocomplete controller action that does this:

Trie.search(params["tag"])

…and then put the results in a response (in my case JSON).

That was my original idea and that’s what I am using my own library for.

dimitarvp · February 24, 2020, 10:38am

PR accepted. Thanks!

dimitarvp · February 24, 2020, 10:39am

Agreed, I’ll add a bit more real-world examples similar to what I pasted to you above.

dimitarvp · February 24, 2020, 10:42am

I understand. I have seen at least 4 schools of thought on that topic and I personally optimise for reading the code top-to-bottom: as you are scrolling and inspecting the code, you stumble upon a public function with almost no body that calls a private function; you then proceed to find it just a few coding lines below it.

Your remark is valid and in my case I haven’t done it your way because I don’t feel the code is very big as to warrant such a separation.

dimitarvp · February 24, 2020, 10:43am

@ityonemo / @Qqwy I see no reason not to make property tests indeed. Save for the fact that it usually takes me a few hours.

dimitarvp · February 24, 2020, 10:57am

Good idea. Noted in the GitHub issues.

At certain point I gave up and just wrote code however I wanted and then run mix format. Even if I don’t like some of its choices – like you – I prefer to have the coding style standardised. Bonus points for having standard git diffs as well.

Agreed. This was one of my first Elixir projects and looking at the code now I definitely went overboard with pattern-matching paranoia. Also noted in the GitHub issues.

Agreed.

Going to fix that even if I don’t feel it’s a major problem. Noted in GitHub issues.

That’s what I thought back then and needed the sanity check. Thank you.

lucaong · February 24, 2020, 11:50am

Nice project I am fond of algorithmic code, and really like to delve into these sort of projects.

I have a suggestion based on similar work I did for one of my projects: I maintain an open source client-side full text search engine (JavaScript, not Elixir), that uses a radix tree (similar enough to a trie) under the hood as the index data structure.

I like to keep the API of the data structure very generic, and compose use-case specific code on top of it. In this case, what you could do, is implement the trie to basically expose the same API as a Map (limited to string keys), plus a few functions to expose the prefix search features. You already implement the Access behavior, so you’re on a good track with it.

The prefix search function could then return another trie (the sub-trie at that prefix) instead of a list of words. If the trie implements the Enumerable protocol, it would be easy enough to turn it into a list of entries.

The advantage would be that your Trie could be used for any use-case where it is a good match. Essentially, any case in which one needs a map of strings to values with efficient prefix search.

In my project, where I essentially apply what I just wrote, the basic data structure code is documented here. It’s JS, but it illustrates my point above.

dimitarvp · February 25, 2020, 10:53pm

Thank you! I think it’s pretty small and humble, you flatter me too much.

Interesting feedback. I can see the value in having more user-oriented API being in another module. But I am 50/50 torn on it due to the project being very small at the moment. Would you still do it even with it being ~400 lines of code (minus the tests)? Perhaps it’s worth it to do some function renames so those that only return sub-tries – or in general don’t return strings – are apparent, while those that are meant for consumption (like search) are given a special small section in the docs to point at them? Hm. Not sure.

Furthermore, I have API that returns sub-tries – the Access-enabling functions. The search function internally uses get which returns a sub-trie as well (and then proceeds to extract all textual matches from it). Or you have something else in mind, on top of these?

lucaong · February 26, 2020, 10:06am

Someone once told me: the best compliment you can receive for your library is “this is too small”

I think I would have one small module implementing the Trie with a Map-like interface, and the Access behavior and Enumerable protocol for it. Then, another module wrapping it and exposing the prefix search feature (or the other user-facing features). This way, each module might even get smaller or at least simpler. This is more or less what I did for the B+tree implementation internally powering cubdb.

That said, this is my personal preference, but you’re the master of your library, you choose the goals, and you should do what you think is best

I am saying that because open-source is great, but it can also be lots of under-appreciated work, so I learned that one should always nurture and protect the part of their project that gives them pleasure and motivation. Often times, that means keeping autonomy, and while hearing feedback, ultimately do what you think is best, given your goals.