I’m writing a program that, given a list of letters, returns all words from a word list that can be made from these letters. So far I’ve managed to come up with this:
defmodule WordGame.FindWords do
@wordlist "priv/static/sowpods.txt"
@doc """
Returns a list of words that can be made out of `letters`.
"""
def run(letters) when is_list(letters) do
@wordlist
|> File.stream!
|> Stream.map(&String.replace(&1, "\n", ""))
|> Stream.filter(&possible?(&1, letters))
|> Enum.to_list
end
@doc """
Returns `true` or `false` depending on whether `word`
can be made out of `letters`.
"""
def possible?(word, letters) when is_bitstring(word) do
word
|> String.split("", trim: true)
|> do_possible?(letters)
end
defp do_possible?([], _letters), do: true
defp do_possible?([letter|others], letters) do
if letter in letters do
do_possible?(others, List.delete(letters, letter))
else
false
end
end
end
It streams the word list, removes the newline character, then filters the word based on the result of possible?/2. This works, but is quite slow, taking around 2.5 seconds for a list of 20 letters; a Python version I wrote earlier takes 1.3 seconds for the same result. Benchfella lists possible?/2 as taking 2.11 µs/op, so I don’t think that’s the bottleneck. I tried calling possible?/2 with Task.async inside the filter function, but that increased the time to about 8 seconds .
At least there is a better way to find the bottleneck. bechfella is to benchmark a single function and to discover regression when trying to squeeze the last out of a known bottleneck.
To identify a bottleneck there are profiling tools available, one of them is integrated into mix and can be invoked by mix profile.fprof.
But a thing I see right now… A Set might eventually be faster than a List. Also try to not split at the empty string, use String.codepoints or String.graphemes, not because of speed but because of meaning.
I’ve been playing around with a program with similar intent. One thing that may help, but I can’t prove it yet, is to divide the word list by letter. In my case, I created a process per (first) letter, to make look ups easier. I think my startup time is no better than yours, but hopefully subsequent lookups are fast.
I am not sure if this will be faster in this specific case, but one thing you could do, is to sort both the letters and each word in the word list.
As letters cannot be re-used, two sorted lists of letters are guaranteed to be the same.
Your current alrogithm loops through all letters in the wordlist word for each letter in your input word. This has time complexity O(nm) (where n is the number of letters in your wordlist word and m is the number of letters in your input word). In practice, assuming that the input word and the word you are comparing it with have about the same length, this is thus O(n²) (Quadratic time complexity)
Sorting your input word is something you only need to do once. Then, sorting each of the wordlist words takes O(n log n) time, but with an O(1) comparison afterwards (same lists means same pointers – immutable languages are so nice!).
You might be able to improve efficiency even further by comparing binary lengths between the input word and wordlist word (because they will never match if these differ).
You are right. In that case, you can sort and check if the string is a substring of the other using String.starts_with/2. However, to use that function you will need to transform the input as follows: string |> list of graphemes |> sort |> convert to binary, which might mean that it will be a bit slower than when you just need to check for equality.