Beyond code that gets it done: code readability. Any advice?

coen.bakker · June 26, 2022, 5:24pm

Besides learning Elixir and ‘getting things done’ with it, I hope to also learn how to write code that is highly readable – and therefore also easy to grasp. I still have a lot to learn in that regard, I’m sure. I try not to rush to a new problem to write a solution for and first clean up some of the mess I have left behind. The code might work completely, but it’s never fun to having to come back to my own code a while later and feeling tempted to rewrite the code, rather than to understand what I wrote in the first place because of the mess.

I have noticed that I often find Elixir code more readable than JavaScript code (given being familiar with both). The pipeline pattern in particular. Nevertheless, I am interested in any of your personal advice about writing readable code.

To give you an example. I wrote this function today. I didn’t think it would be this long and involved. I want to clean it up further (I made a tiny start already).

I love pipelines that consists of highly descriptive code. However, in this case that would mean writing a lot of helper function, possibly.

I have also seen people put comments behind pipe elements, to provide description. Often, that results in a lot of info located in one place, though.

  alias AppWeb.Component.Helpers

  defp histogram_frequencies(responses, variable, bucket_size, start_value, bucket_count) do
    end_bucket = start_value + bucket_count * bucket_size
    max_value =
      responses
      |> Enum.map(fn %{^variable => value} -> value end)
      |> Enum.max

    bucket_number_list = Enum.to_list(1..bucket_count)
    zero_list = Enum.map(bucket_number_list, fn b -> {b, 0} end)
    sums =
      responses
      |> Helpers.frequencies(variable)
      |> Enum.filter(fn {age, _count} -> age < end_bucket end)
      |> Enum.map(fn {age, count} -> {trunc(age/bucket_size), count} end)
      |> Kernel.++(zero_list)
      |> Enum.group_by(fn {key, _value} -> key end)
      |> Enum.map(fn {_key, value} -> Enum.map(value, fn {_x, e} -> e end) end)
      |> Enum.map(fn p -> Enum.sum(p) end)

    lower_limits_buckets = Helpers.lower_limits_buckets(bucket_size, start_value, bucket_count)
    upper_limits_buckets = Helpers.upper_limits_buckets(bucket_size, start_value, bucket_count)
    buckets =
      [lower_limits_buckets, upper_limits_buckets]
      |> Enum.zip_with(fn [x, y] -> "#{x}-#{y}" end)

    # If any values greater than range of last bucket,
    # put them into the last bucket
    # and change that bucket's name accordingly (e.g. "60-70" becomes "60+").
    case max_value < end_bucket do
      true ->
        Enum.zip(buckets, sums)

      false ->
        last_bucket = "#{start_value + bucket_count * bucket_size}+"
        last_sum =
          responses
          |> Helpers.frequencies(variable)
          |> Enum.filter(fn {key, _value} -> key >= end_bucket end)
          |> Enum.map(fn {_age, count} -> count end)
          |> Enum.sum

        sums = sums++[last_sum]
        buckets = buckets++[last_bucket]

        Enum.zip(buckets, sums)
    end
  end

ityonemo · June 26, 2022, 5:47pm

Take those pipelines and make them their own defp functions; the case should probably be an “if”.

There’s general elixir coding guidelines too. Fetch vs fetch! vs get have specific meanings, avoid is_ functions unless they are guards,

Avoid variables called “value”, lol. I have a linter for that to get me out of the habit.

coen.bakker · June 26, 2022, 5:51pm

Thanks. I’ll look into those guidelines.

And now I come to think of it. I guess my concern with readability really also extends to efficient and performant code, really. Since those can potentially also can be optimized after having gotten a initial working solution.

sabiwara · June 26, 2022, 10:39pm

Regarding efficiency and performance, you could improve it by doing some of these successive Enum operations in one pass, which should avoid building intermediate lists and walk them twice.

filter |> map could be re-implemented with a comprehension:

  |> Enum.filter(fn {key, _value} -> key >= end_bucket end)
  |> Enum.map(fn {_age, count} -> count end)

could be

for {age, _count} <- age_frequencies, age >= end_bucket, do: count

map |> map should typically be avoided, since you can do it directly in one pass:

  |> Enum.map(fn {_key, value} -> Enum.map(value, fn {_x, e} -> e end) end)
  |> Enum.map(fn p -> Enum.sum(p) end)

could be

  |> Enum.map(fn {_key, value} -> Enum.map(value, fn {_x, e} -> e end) |> Enum.sum() end)

and even map |> sum could be replaced by Enum/reduce/3 here (although this one might be slightly less readable):

  |> Enum.map(fn {_key, value} -> Enum.reduce(value, 0, fn {_x, e}, acc -> e + acc end) end)

Credo has some checks like MapMap, FilterFilter, MapJoin… to help detect some of these patterns.

I didn’t mention Stream, since it also comes with some overhead and would probably not improve performance here except if you are working with large lists.

al2o3cr · June 27, 2022, 2:15am

Functional question: what is this line intended to calculate? It could return 0 for age < bucket_size, which conflicts with the definition of bucket_number_list as starting at 1.

A general rule I find useful: if you see multiple functions with similar prefixes / suffixes, consider if there’s a data structure hiding in the code.

A similar outcome from a different thing: if you see multiple arguments that are always handled together, consider if there’s a data structure hiding in the code.

For instance, a struct called Buckets:

defmodule Buckets do
  defstruct [:size, :start, :count]

  def index_of(x, b) do
    (x - b.start + b.size) / b.size
    |> floor()
    |> max(0)
    |> min(b.count+1)
  end

  def label_for(0, b), do: "under #{upper_bound(0, b)}"
  def label_for(index, %{count: count} = b) when index == count + 1, do: "#{lower_bound(count, b)}+"
  def label_for(index, b) do
    "#{lower_bound(index, b)}-#{upper_bound(index, b)}"
  end

  def lower_bound(0, _), do: nil
  def lower_bound(index, b), do: b.start + (index - 1) * b.size

  def upper_bound(index, %{count: count}) when index == count + 1, do: nil
  def upper_bound(index, b), do: b.start + index * b.size

  def trim_ends(map, b) do
    map
    |> trim_at(0)
    |> trim_at(b.count + 1)
  end

  defp trim_at(map, index) do
    if Map.get(map, index) == 0 do
      Map.delete(map, index)
    else
      map
    end
  end

  def zero_pad(map, b) do
    1..b.count
    |> Enum.reduce(map, fn index, acc ->
      Map.put_new(acc, index, 0)
    end)
  end
end

Then the main function can use these higher-level concepts:

defp histogram_frequencies(responses, variable, bucket_size, start_value, bucket_count) do
  # could also be passed in as an argument
  buckets = %Buckets{size: bucket_size, start: start_value, count: bucket_count}

  responses
  |> Enum.frequencies_by(fn %{^variable => value} -> Buckets.index_of(value, buckets) end)
  |> Buckets.trim_ends(buckets)
  |> Buckets.zero_pad(buckets)
end

There are a lot of advantages to this approach:

the small functions in Buckets are easier to read / test / debug
naming gets easier - inside Buckets, just saying count is sufficient (versus bucket_count)
errors like swapping start_value and bucket_size - which will run, since both are numbers, but produce nonsense output - are avoided by passing around a whole struct

One other side-effect of this approach: for large lists in responses where at least one value lands in the “over the limit” bucket, this method will be about twice as fast because it only constructs the frequencies map once!

Sebb · June 27, 2022, 7:54am

In general I think any pure function has the potential to be understandable.
You just have to load the structure of the incoming data into your brain and understand what’s happening in the steps. So this is the main point for me, make it possible for the reader to follow.

Imagine someone telling a story in plain english. If you lose track of the (grammatical) subject, you can’t understand the story. So in a (pure) function, the subject is some data coming in, transformed with the help of some other data structures.

There are two basic techniques already mentioned to make this easier:

While a defp adds some mental load, it also has the potential to reduce it. It is almost always worth it, when there is a clear transformation of data in the block refactored to a defp that can be neatly described by the function’s name.
If I don’t want to defp I somethimes just add a comment with an example of the current shape of the data.
Sometimes there is so much going on that a new module is of need. Then definitely add a @spec which greatly helps with understanding whats going in and out.

Reducing the possible structures of data flowing through your functions by using structs (and in the first step, understanding whats there) immensly reduces the mental load of reading a function.

The rest I think is “just” naming things (hard) and using the tools correctly.

coen.bakker · June 27, 2022, 8:08pm

This is honestly a huge help. Thank you.

coen.bakker · June 27, 2022, 8:33pm

My intention was to sum frequencies by bucket (e.g. all respondents between ages 18-27, 28-37, 38-47, etc.). Trunc(age/bucket_size) discriminates ages from different buckets. But all-in-all it takes a lot of steps to sum by bucket in my code (and I should have also subtracted ‘start_value’ from ‘age’ – since otherwise the lower end doesn’t get trimmed properly. Luckily, nobody depends on this code ).

coen.bakker · June 27, 2022, 8:44pm

Love this:

Hope to be able to put this in practice more.

coen.bakker · June 28, 2022, 12:21am

I think it cleaned up nicely already. Must say: has been very instructive exercise.

defmodule SurveyWeb.Structs.Bins do
  defstruct [:size, :start, :count]

  def index_of(value, bins) when value >= bins.start + bins.size * bins.count do
    bins.count - 1
  end

  def index_of(value, bins) do
    (value - bins.start)/bins.size
    |> trunc()
  end

  def index_to_label(index, bins) when index == bins.count - 1 do
    lower_limit = bins.start + bins.size * index
    "#{lower_limit}+"
  end

  def index_to_label(index, bins) do
    lower_limit = bins.start + bins.size * index
    upper_limit = lower_limit + bins.size - 1
    "#{lower_limit}-#{upper_limit}"
  end
end

defmodule SurveyWeb.Components.Histogram do
  use SurveyWeb, :live_component

  alias SurveyWeb.Components.Helpers
  alias SurveyWeb.Structs.Bins

  defp histogram_frequencies(responses, target_variable, bin_size, start_value, bin_count) do
    bins = %Bins{size: bin_size, start: start_value, count: bin_count}

    responses
    |> Enum.frequencies_by(fn %{^target_variable => value } -> Bins.index_of(value, bins) end)
    |> add_empty_bins(bins)
    |> List.first() # Temp. Getting rid of this line.
    |> Enum.map(fn {index, count} -> {Bins.index_to_label(index, bins), count} end)
  end

  defp add_empty_bins(frequencies, bins) do
    missing_indexes = Enum.to_list(0..bins.count - 1) -- Map.keys(frequencies)
    for index <- missing_indexes do
      Map.put(frequencies, index, 0)
    end
  end

  def render(assigns) do
    ~H"""
    <span class={"#{@class} histogram"}>
      <%= for {category, count} <-
        histogram_frequencies(@responses, @target_variable, @bin_size, @start_value, @bin_count) do %>
        <div class="bar" style={"height: #{count/Enum.count(@responses)*100}%"}>
          <div class="count">
            <%= count %>
            <%= "(#{Helpers.decimal_responses_percentage(@responses, count, 1)})" %>
          </div>
          <div class="category">
            <%= category %>
          </div>
        </div>
      <% end %>
    </span>
    """
  end
end

ityonemo · June 28, 2022, 4:17am

Just a minor note, by convention, the functions in a module for a struct should take the struct as a first argument, if that makes sense. Or generally, not just for structs, if the module is a noun, (List, String, Registry, etc)

coen.bakker · June 28, 2022, 2:58pm

If I make bins the first argument in all functions in SurveyWeb.Structs.Bins, I will not always be able to pass the piped argument as the first argument in histogram_frequencies/5.

For example, I might get:

|> (&Bins.add_empty_bins(bins, &1)).()

Instead of:

|> Bins.add_empty_bins(bins)

I could make bins the argument of the pipeline in histogram_frequencies/5, instead of ‘responses’. My original idea was not to do that, though, because my app revolves around the concept ‘survey’ – and therefore responses seemed a logical choice for the role of protagonist. For example: “This pipeline turns survey responses into a data-object that a histogram function can process.” instead of “This pipeline turns this bins struct into a data-object that a histogram function can process.”

Any recommendations to what is a better approach, generally speaking? Pick the most logical protagonist for your pipeline, regardless of whether it will be the first argument passed to all segments of the pipeline? Or will I end up regretting |> (&Bins.add_empty_bins(bins, &1)).()?

BartOtten · June 28, 2022, 4:01pm

-without reading the whole thread-

When those are private functions, I would take the pragmatic ordering that makes the flow in the public function as clear as possible. After all

the code (convention red.) is more what you’d call “guidelines” than actual rules. - Captain Barbossa

Public functions however should be predictable; taking the struct as first argument.

edit: Barbossa notified me the guidelines only apply when you are a pirate. Sorry for the confusion.