I have a list of 100 million strings, each string is up to 20 characters long.
I need to check if an arbitrary string (or a small list of strings) exists in this list.
How to do it quickly and efficiently?
Does it make sense to hash all strings and check hashes?
Either hash, but if the set of strings is known beforehand then you can use finite state transducers or tries. Alternatively you can try to find perfect hash function, but it can be hard and expensive. I think that in this case trie would be the easiest solution.
Depending on your source of the list, though, there might be ways that are more efficient (eg. if from DB, ask the DB, if file from disk, stop loading of the file once the item is found, etc).
If though you have to query the list multiple times and have to load it into memory completely, it might make it faster to convert the heystack list to a
MapSet first and query the mapset for membership of the needle.
- Partition the big list for speed. (for ex. by first 1 or 2 letters).
- Bloom filters (can have false positives): https://hex.pm/packages?search=bloom&sort=recent_downloads
Do you also need to find strings that are similar but not exactly equal?
Just wondering how you get a list of 100 million strings. What is the context?
An addition to what I have said already.
If your list is known at compile time, you could create a multihead function from it using some metaprogramming.
This makes use of pattern matching, which again gets optimised into a prefix tree.
~w(a b c Foo Bar) |> Enum.each(fn word -> def in_list?(unquote(word)), do: true end) def in_list?(_), do: false
The API can be seen in the benchmark.
I was writing something similar but with wider API.
For a list this large, a suffix tree has a lot of advantages. I haven’t yet had time to implement it in Elixir, but it’s on my list, and I have been gathering info to work on it. Dan Gusfeld’s Algorithms on Strings, Trees, and Sequences describes the construction in detail, based on methods devised by Ukkonen. It’s a nontrivial thing to implement, but if this is going to be a recurring theme in your work it would be worth it. There are a number of implementations in other languages, but none yet in Elixir.