Trouble making word-finding algorithm more efficient

I’m writing a program that, given a list of letters, returns all words from a word list that can be made from these letters. So far I’ve managed to come up with this:

defmodule WordGame.FindWords do
  @wordlist "priv/static/sowpods.txt"

  @doc """
  Returns a list of words that can be made out of `letters`.
  """
  def run(letters) when is_list(letters) do
    @wordlist
    |> File.stream!
    |> Stream.map(&String.replace(&1, "\n", ""))
    |> Stream.filter(&possible?(&1, letters))
    |> Enum.to_list
  end

  @doc """
  Returns `true` or `false` depending on whether `word`
  can be made out of `letters`.
  """
  def possible?(word, letters) when is_bitstring(word) do
    word
    |> String.split("", trim: true)
    |> do_possible?(letters)
  end

  defp do_possible?([], _letters), do: true
  defp do_possible?([letter|others], letters) do
    if letter in letters do
      do_possible?(others, List.delete(letters, letter))
    else
      false
    end
  end
end

It streams the word list, removes the newline character, then filters the word based on the result of possible?/2. This works, but is quite slow, taking around 2.5 seconds for a list of 20 letters; a Python version I wrote earlier takes 1.3 seconds for the same result. Benchfella lists possible?/2 as taking 2.11 µs/op, so I don’t think that’s the bottleneck. I tried calling possible?/2 with Task.async inside the filter function, but that increased the time to about 8 seconds :confused:.

Is there any way to make it more efficient?

1 Like

At least there is a better way to find the bottleneck. bechfella is to benchmark a single function and to discover regression when trying to squeeze the last out of a known bottleneck.

To identify a bottleneck there are profiling tools available, one of them is integrated into mix and can be invoked by mix profile.fprof.

But a thing I see right now… A Set might eventually be faster than a List. Also try to not split at the empty string, use String.codepoints or String.graphemes, not because of speed but because of meaning.

2 Likes

did you try – instead to see the diff in the list?

I’ve been playing around with a program with similar intent. One thing that may help, but I can’t prove it yet, is to divide the word list by letter. In my case, I created a process per (first) letter, to make look ups easier. I think my startup time is no better than yours, but hopefully subsequent lookups are fast.

1 Like

I am not sure if this will be faster in this specific case, but one thing you could do, is to sort both the letters and each word in the word list.

As letters cannot be re-used, two sorted lists of letters are guaranteed to be the same.

Your current alrogithm loops through all letters in the wordlist word for each letter in your input word. This has time complexity O(nm) (where n is the number of letters in your wordlist word and m is the number of letters in your input word). In practice, assuming that the input word and the word you are comparing it with have about the same length, this is thus O(n²) (Quadratic time complexity)

Sorting your input word is something you only need to do once. Then, sorting each of the wordlist words takes O(n log n) time, but with an O(1) comparison afterwards (same lists means same pointers – immutable languages are so nice!).

You might be able to improve efficiency even further by comparing binary lengths between the input word and wordlist word (because they will never match if these differ).

1 Like

Sorting may work or may not, the problem states that the “word” can be built from the “letters”, we are not told if we need to use all letters.

“ale” can be made from [a, l, e, m] and is valid by one of the descriptions and not by the other,

You are right. In that case, you can sort and check if the string is a substring of the other using String.starts_with/2. However, to use that function you will need to transform the input as follows: string |> list of graphemes |> sort |> convert to binary, which might mean that it will be a bit slower than when you just need to check for equality.