Avoiding repeating and hardcoding keys across module definitions

I have a large text file that I need to read line-by-line and collect various statistics about its contents, e.g. the amount of certain phrases, words, characters, etc. Then, I need to turn those statistics into a neat human-readable string.

My current code roughly looks like this:

defmodule MyStruct do
  @moduledoc false
  defstruct total: 0, foo: 0, bar: 0, baz: 0
end
defmodule Parser do
  @moduledoc false
  defp translations do
    %{
      "Example String 1" => %{translation: :foo},
      "Example String 2" => %{translation: :bar}
      # Some complex strings that optionally may need to be further parsed with regexps, etc., hence maps as values
      # ...
    }
  end

  def parse_file do
    "large_text_file.txt"
    |> File.stream!()
    |> Enum.reduce(%MyStruct{}, fn line, acc -> maybe_update_acc(line, acc) end)
  end

  defp increment(map, key), do: Map.update(map, key, 1, &(&1 + 1))

  defp find_tl(string) do
    Enum.find_value(translations(), fn {k, v} -> if String.contains?(string, k), do: v.translation end)
  end

  defp maybe_update_acc(line, acc) do
    tl = find_tl(line)

    acc
    |> increment(:total)
    |> increment(tl)
    # Irrelevant logic for regexps and other conditions to update the acc further
    # ...
  end
end
defmodule Stringifier do
  @moduledoc false
  defp translations do
    # Is a list because the result needs to be ordered in a certain way
    [
      total: "Everything, duh",
      foo: "Those fabulous foos",
      bar: "Them beautiful bars"
      # ...
    ]
  end

  def translate(%MyStruct{} = struct) do
    Enum.map(translations(), fn {key, string} ->
      struct_value = struct[key]
      if struct_value, do: translate_key(key, string, struct_value)
    end)
  end

  def translate_key(key, string, value) do
    case key do
      # Keys that need unique formatting or other logic
      # ...

      key -> "#{string} - #{value}"
    end
  end
end

This approach has some serious flaws. The most obvious one is that every time I need to add a new string to collect data about or remove/change an existing one, I need to edit my code in three places: the struct definition, the parser and the stringifier. This is bad. If I forget about one, in the best case scenario my application crashes, in the worst case scenario I get a completely wrong result.

However, I can’t come up with a better way to do things. Maybe I could make a single keyword list like this, [foo: %{term: "Example String 1", string: "Those fabulous foos"}], which will give me the required ordering and will allow to unify the translations from Parser and Stringifier modules.

At a first glance, this probably won’t add too much strain on my find_tl/1 function, but because I also have “meta” keys (like :total) that don’t resolve to any terms from the file, this, in fact, will actually increase the complexion. Usually I don’t care about such overhead if it means for better readability, but the text file can have billions of lines, so I need all the speed I can get.

Some “meta” keys need to be in the front, some in the back, some in the middle, so simply prepending/appending them in the Stringifier module is not an option.

Besides, this would mean that I still will have to edit the struct. Though I guess this is probably not that important, because I don’t do any struct-related compile-time checks.

1 Like

I don’t see the struct adding any value in this case; a plain map would serve the same purpose but save editing one place when adding a new value to collect.

A good way to spot this is that both reading or writing of the counts uses map operations ([] and Map.update) versus compile-time references like .some_field.

I’d recommend a list of two-element tuples for Parser’s translations:

  @moduledoc false
  defp translations do
    [
      {"Example String 1", :foo],
      {"Example String 2", :bar],
      # Some complex strings that optionally may need to be further parsed with regexps, etc., hence bigger tuples
      # {~r/some weird thing/, extra_args, :baz}
    ]
  end

This ensures that functions like Enum.find_value(translations(), ...) always check to see if they match in the same order, ensuring that overlapping definitions like:

      {"Example String", :short_foo],
      {"Example String 1", :foo],

always give the same result.


Stringifier has a similar-but different list just like in your already-posted code, which is slightly easier to write since the first element of all the tuples is an atom.

Again the list ensures that the result iterates over the keys in the correct (& consistent) order.


One place to NOT use a list is for keeping the counts! For instance:

    # CAUTION: PERFORMANCE HAZARD BELOW
    "large_text_file.txt"
    |> File.stream!()
    |> Enum.reduce([total: 0], fn line, acc ->
      tl = find_tl(line)
      acc
      |> Keyword.update(:total, 1, &(&1+1))
      |> Keyword.update(tl, 1, &(&1+1))
    end)

Unlike in the previous cases, the key order of the result isn’t something we care about - and maintaining it means that Keyword.update is slower than Map.update by a factor roughly the size of acc. This kind of random access is a good indicator that a map is the right choice.


This still leaves two places that need editing when adding a new value to be counted, but IMO that’s not unreasonable:

  • one place says “this is how to collect this count”
  • the other place says “this is how to display this count”

Fusing the two together into one list would force one of the orderings discussed above to match the other; either the display order would need to exactly follow the matching order, or the other way around.

2 Likes

I don’t see the struct adding any value in this case; a plain map would serve the same purpose but save editing one place when adding a new value to collect.

True. In this particular case, I only use a struct because it, in my opinion, improves code readability by making it easier to understand what’s being passed around. Other than that, it serves pretty much no function in my code. Once again, no compile-time references, and I don’t need to implement any protocols or whatever. I’m not sure if defining an empty struct just for the sake of readability is not a code smell, so I’ll probably just remove the struct altogether.

This ensures that functions like Enum.find_value(translations(), ...) always check to see if they match in the same order, ensuring that overlapping definitions like:

Sorry, I should’ve probably mentioned that such overlapping strings aren’t possible in the file, so it’s not a concern and there’s no reason to use a list of tuples. But I’ll take note of your suggestion in case things change, though, or for other projects, so thank you anyway.

This still leaves two places that need editing when adding a new value to be counted, but IMO that’s not unreasonable

Yeah, I guess that’s perfectly reasonable. I was so obsessed with attempting to make all those translations, uh, “monolithic”, that I completely forgot that I’m doing two completely different things in two completely different parts of my application. Thank you for reminding me of that and for essentially clearing my doubts on whether such repetitions are a bad practice or not.

Seems like the correct way to tackle this problem would be to simply write a test that checks if all translations from the parser are present in the stringifier to ensure that there are no typos or whatnot.

1 Like

I’ve just run into a very similar problem and my approach was to use structs to represent the data that was parsed out of the file at first, and having the output of the initial parse of the text file be a list of structs that represent the meaning extracted from the text. Then if you want a more standardized final product you can reduce this list into a final struct and the logic will be much cleaner because you can pattern match on structs, and even organize further logic into the struct modules.

You can also combined named regex captures with new/1 functions in your struct modules that accept maps with keys corresponding to your regex capture to cut down on repetition in your code, for example:

def ParseMod do
  @patterns [{~r/(?<some_name>\w+)/, StructMod}, ...]

  def parse(file_name) do
    file_name
    |> File.stream!()
    |> Stream.flat_map(fn line -> 
      Enum.reduce_while(@patterns, [], fn {pattern, mod}, [] -> 
       if match = Regex.named_captures(pattern, line) do
        {:halt, [apply(mod, :new, [match])]}
       else
        {:cont, []}
       end)
    end)
  end
end

Now when you have more data you want to extract you add a pattern entry and create a corresponding module with the appropriate new function and things stay nice and tidy.