Refactoring code for efficiency

absowoot · January 7, 2023, 10:32pm

Hello everyone! I am working through @pragdave course on building a hangman app and would like to ask for some refactoring advice. I am about halfway through the course and I am working on building an app where the computer plays the hangman game.

I have the following code to retrieve the most commonly used letter in a list of words. It works as expected, but it does not read/feel very elegant and I suspect it’s overcomplicated, probably because of my OOP background. Could anyone offer some advice on how to clean this up?

 @spec get_most_common_letter(possible_words) :: {:letter, :occurrence}
  def get_most_common_letter(possible_words) do
    Enum.reduce(possible_words, %{}, fn word, acc ->
      get_word_most_common_letters(word, acc)
    end)
    |> Map.to_list()
    |> List.keysort(1, :desc)
    |> Enum.at(0)
  end

  defp get_word_most_common_letters(word, counts) do
    chars = String.codepoints(word)

    Enum.reduce(chars, counts, fn char, acc ->
      Map.update(acc, char, 1, &(&1 + 1))
    end)
  end

zachallaun · January 7, 2023, 11:40pm

Hi @absowoot!

Here are some suggestions:

When you see this sort of “nested reduce” pattern, you can sometimes use Elixir’s for-comprehensions instead. This lets you model nested loops of a sort, and can even be used to reduce:

defp letter_frequencies(words) do
  for word <- words,
      letter <- String.codepoints(word),
      reduce: %{} do
    freqs -> Map.update(freqs, letter, 1, &(&1 + 1))
  end
end

Converting a map to a list explicitly is usually unnecessary – in this case, it’s so that you can use List.keysort/2, but maps are already Enumerable, so you can look to the Enum module instead. In this case, Enum.max_by/2:

{letter, _frequency} =
  possible_words
  |> letter_frequencies()
  |> Enum.max_by(fn {_letter, frequency} -> frequency end)

When you enumerate a map, the elements are {key, value} tuples, so you can use max_by to specify the value you want compared when calculating the max. In this case, we’re destructuring the tuple and returning the frequency.

Finally, the correct spec for that function would be:

@spec get_most_common_letter([String.t(), ...]) :: String.codepoint()
#                                         ^^^ list must be non-empty

absowoot · January 8, 2023, 3:31am

Thank you for your response Zach, it was super helpful! The for comprehension does make it a bit easier to read/understand and I didn’t realize you could use it for nesting. Previously, I was converting to map because I could not find the right function to sort the map but it appears I didn’t look hard enough.

Now onto excluding previous guesses and refining the list of words on every iteration

adamu · January 8, 2023, 7:04am

At the risk of turning this into a FizzBuzz thread, here’s a simple way to do it with Enum.flat_map/2:

words
|> Enum.flat_map(&String.graphemes/1)
|> Enum.frequencies()
|> Enum.max_by(fn {_k, v} -> v end)

The result is the same as your version.

Eiji · January 8, 2023, 10:00am

Unfortunately, none of above solutions looks correct and fast for me.

Here is my version:

defmodule Example do
  # function head
  def sample(words, data \\ {%{}, [], 0})

  # no more words -> return most commonly used chars
  def sample([], {_counts, chars, _count}) do
    # if you call `Enum.reverse/1` on chars then you would keep a proper order of chars
    # if interested you can also call `Enum.sort/1` to keep a sorted order of chars
    Enum.map(chars, &<<&1::utf8>>)
  end

  # call recursively this function
  # firstly for each char in word
  # and then for all other words
  def sample([word | words], {counts, chars, count}) do
    word |> sample({counts, chars, count}) |> then(&sample(words, &1))
  end

  # no more chars in word -> return data for 2nd recursion call in above clausule
  def sample("", data), do: data

  def sample(<<char::utf8, rest::binary>>, {counts, chars, count}) do
    {chars, count} =
      cond do
      	# when no chars yet or char is most commonly used so far
        count == 0 or char in chars ->
        	# set char as the only most commonly used char and increase count by 1
          {[char], count + 1}

        # for every other char if the most commonly used chars so far have count 1
        # or when the most commonly used char so far have count greater by 1
        #   comparing to current char
        count == 1 or match?(%{^char => char_count} when char_count + 1 == count, counts) ->
        	# simply add char to the list of most commonly used chars so far and don't change count
          {[char | chars], count}

        true ->
        	# otherwise there is no need to change list of most commonly used char so far and count
          {chars, count}
      end

    # yet another recursive call this time for rest chars in word
    # no matter what case we increment character counts map
    # and assign chars and count determined by specific case from above code
    sample(rest, {Map.update(counts, char, 1, &(&1 + 1)), chars, count})
  end
end

and here is why my version is better:

First of all we have a correct result i.e. a list of characters.
Just for example, for this expression: ~w"This is an example" the correct result is: ["e", "a", "s", "i"].
My code iterates the list of words and each word strings only once.
The result is updated on the fly, so there is no need to fetch it from the map of character counts

Helpful resources:

absowoot · January 8, 2023, 2:07pm

Interesting! After some quick research, basically a grapheme is correct in this scenario vs a codepoint because letters in other languages can be more than one codepoint, right?

Enum.flap_map flattens out all of the words into a lonnnng list of letters, then Enum.frequencies groups the letters together.

If I have a list of 5000 words, could this cause a memory issue?

absowoot · January 8, 2023, 2:39pm

Your code does indeed return the desired result and thanks for commenting in the code of how it works! It certainly has some bits that I am not familiar with so I will need to continue reviewing docs and tutorials. How long did this take you to put together? I would imagine after years of working in a (functional) language, this pattern of recursion is common and now muscle memory?

Could you explain that a bit further? Is it that each word is broken down into a list of letters → then that list is looped through to get a count of each character? I’m sure it wouldn’t matter is such a small application but I could see that being an issue at scale.

Is there a way to compare each solution’s memory usage, time to execute, etc. so I can get a better understanding?

zachallaun · January 8, 2023, 3:12pm

Yes! Great opportunity to check out Benchee! GitHub - bencheeorg/benchee: Easy and extensible benchmarking in Elixir providing you with lots of statistics!

absowoot · January 8, 2023, 3:50pm

Very cool! I’m enjoying Elixir more and more.

For anyone curious, here are the stats:
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 8
Available memory: 16 GB

“sample” = Eiji approach
“flat_map” = Adamu approach

list length = 1505 words (6 letter words)

Name               ips        average  deviation         median         99th %
sample          1.49 K        0.67 ms     ±4.24%        0.67 ms        0.76 ms
flat_map        0.39 K        2.57 ms    ±29.07%        2.47 ms        3.91 ms

Comparison: 
sample          1.49 K
flat_map        0.39 K - 3.81x slower +1.89 ms

list length = 8879 (any length)

Name               ips        average  deviation         median         99th %
sample          214.04        4.67 ms     ±4.05%        4.63 ms        5.23 ms
flat_map         53.21       18.79 ms    ±11.81%       18.34 ms       29.04 ms

Comparison: 
sample          214.04
flat_map         53.21 - 4.02x slower +14.12 ms

Eiji · January 8, 2023, 5:46pm

Generally in programming there are many ways to solve one thing. It’s not a big deal to learn that. The rule is older than Elixir and me as well. It’s just a practice. I can imagine that [head | tail] notation or <<char::utf8>> for example is a new thing for many people even coming from other languages.

In general you should be able to write many things using:

Recursion like already said
for loop like: for <<char::utf8 <- "example">>, do: <<char::utf8>>
Enum functions, especially reduce, but also List.foldl and List.foldr (list left-side and right-side reducers)
Stream.unfold/2 like:

"example"                                               
|> Stream.unfold(fn                                     
  "" -> nil                                             
  <<char::utf8, rest::binary>> -> {<<char::utf8>>, rest}
end)                                                    
|> Enum.to_list()

I have around 7.5 year in production including 5.5 year only in Elixir, so I can’t say that muscle memory have nothing to say here, but don’t think about it like about a hell. I’m trying to make literally every single day without stress and it’s not really hard if you have a good food and money savings. The rest is just a matter of time. Headache and other symptoms are definitely not a must have. Well … at least not for me.

Eiji · January 8, 2023, 6:02pm

Ok, so our input is for example: ["This", "is", "example"].

First we iterate “main” list, so that we pattern match using [head | tail] notation where in this case head is "This" and tail is ["is", "example"].
Under this we firstly call recursively for head
In similar way we iterate “nested” string, so that we pattern match using <<char::utf8, rest::binary>> where in this case char is 84 (codepoint of T character) and rest is "his".
Under this we firstly process char and later call recursively for rest.
Once recursive ends we are returning information up to point 2, so the word This is iterated only once.
Using new data as we call recursively with a new words which are now tail of last pattern matching (in point 1), so the list of words is iterated only once.
Once list of words ends we are returning all of stored codepoints as a list of characters.

As you can see word is not transformed into list - it’s just iterated character by character using pattern matching.

gregvaughn · January 8, 2023, 7:26pm

This suggests you don’t want one optimized algorithm to get the most common letter (which could be a tie of multiple letters that I don’t think you’ve considered). I think you want to maintain state of the overall frequency map, then iteratively exclude/refine by subtracting from some keys in that map. As a second operation, you want to find the most common letters to form a guess from the current frequency map.

One hybrid approach to generating the frequency map from a list of possible_words is to use a for comprehension with a bitstring generator

frequencies = for word <- possible_words, <<char::utf8 <- word>>, reduce: %{} do
  acc -> Map.update(acc, <<char::utf8>>, 1, & &1 + 1)
end

Then if you want to remove certain guesses from that, you could calculate their frequencies (by this algorithm or Enum.frequencies) and a proper 3rd parameter merge function in Map.merge/3 could subtract the counts. For a large corpus of words, this is more efficient than recalculating from scratch.

You could get more sophisticated in your guess logic by considering ties for the most common frequency letter and randomly choosing one of them. Going even further, you could look at the top n frequency letters and do some weighted choice favoring the highest frequency, but still sometimes choosing a slightly lower one.

A more useful data structure for that sort of analysis is an inversion of the frequency map. One way of generating that is

commons =
  frequencies
  |> Enum.group_by(&elem(&1, 1), &elem(&1, 0))
  |> Enum.sort(:desc)

which will be a proplist of integer frequency to list of letters (the sort operation converted from a map data structure to a list of key/value tuples to preserve ordering). Then in the simple case, you could

guess =
  commons
  |> List.first()
  |> elem(1)
  |> Enum.random()

I’ll leave the weighted choice logic as an exercise for the reader

absowoot · January 8, 2023, 10:36pm

You make a valid point that I did not consider if there could be a tie between multiple letters.

Originally my plan was to:

Create possible_words list based on how many letters are in the word
Guess most frequent letter in the possible_words list
- If guess is good, remove words from possible_words that do NOT contain that letter
- If guess is bad, remove words from possible_words that do contain that letter
Iterate through the revised possible_words list to find most frequent letter (and remove the previous guessed letter)

You’re suggesting that instead, I should keep the frequencies list, determine what words should be removed, and then subtract the letters from those words in the list, and that would be the more efficient approach?

gregvaughn · January 8, 2023, 11:23pm

Aha, the point I missed is that you’d want to remove many words from the possible list – all the ones that contain a failing letter. In that case, the efficiency is less clear. If you have a starting body of 1000 or 100,000 words, the algorithms’ performance will diverge. For smaller sets of words, your algorithm is probably going to be fine.

Also, given your algorithm, the resolution of tied most-frequent letters is less important. Since this seems to be an exercise you are using for educational purposes, I recommend starting simple and getting something that works accurately, then tackle some more advanced optimizations/algorithms if you wish.

absowoot · January 9, 2023, 4:31am

Thanks everyone for your responses and insight! It’s really great to receive assistance for a small beginner project like this.