String concatenation of multiple pieces of string

Related question: Implications of String concatenation vs. IOList for ANSI color codes

I’m wondering the performance difference between

ANSI.red <> "red" <> ANSI.reset <> ANSI.green <> "green" <> ANSI.reset

and

:erlang.list_to_binary([ANSI.red, "red", ANSI.reset, ANSI.green, "green", ANSI.reset])

I guess the one with :erlang.list_to_binary/1 is faster, but I might be wrong. Any suggestion welcome.

1 Like

IO lists offer some amazing performance benefits, but if you’re new to Elixir, first focus on “make it work make it work right, make it work fast” in that order. Actually, calling :erlang.list_to_binary/1 explicitly is a bit of a red flag. The true speed comes from passing IO lists into the final methods that write to the device, whether that is an IO.puts or a socket, etc.

To focus on your explicit question, I’m not entirely sure. I’d expect the 2nd one to be faster, but you’d want to benchmark it to be certain. The first one will create a new binary with each <> operator, but I would hope the 2nd one would create a single binary and consecutively write each piece of the IO list into it.

3 Likes

Containing doing things in Elixir functions is an acceptable principle. If putting multiple binaries into one list does not gain much performance, using <> operator certainly looks easier to maintain. Maybe I’m too much influenced by the Erlang idioms.

I was thinking about the difference on forming a large string (or binary) for output, such as that for Plug.Conn.send_resp/3, formed as a list of partial strings.

@gregvaughn is absolutely right that correctness is the first priority. After that if you’re not satisfied with performance you need to measure, measure, measure.

Benchmarking

Thanks to your question I took the opportunity to try the bmark benchmarking tool. Here’s a quick benchmark test I wrote based on the direction in the project’s README and your suggested expressions.

defmodule StringContenation do
  use Bmark

  alias IO.ANSI
  
  bmark :binary_concat_operator, runs: 1000 do
    ANSI.red <> "red" <> ANSI.reset <> ANSI.green <> "green" <> ANSI.reset
    |> IO.puts
  end

  bmark :erlang_list_to_binary, runs: 1000 do
    :erlang.list_to_binary([ANSI.red, "red", ANSI.reset, ANSI.green, "green", ANSI.reset])
    |> IO.puts
  end

I’m piping to IO.puts here so we can compare these approaches against IO lists. By default bmark does 20 runs per benchmark block, but I’ve increased them to 1000 to get more statistically significant results.

Next I run mix bmark. As expected I see a bunch of redgreen lines printed and colored appropriately.

Results

The benchmark timing data are in the $PROJ_ROOT/results/ directory. To compare the two implementations I run

mix bmark.cmp results/stringcontenation.binary_concat_operator.results \
              results/stringcontenation.erlang_list_to_binary.results

And the results!

results/stringcontenation.binary_concat_operator.results: results/stringcontenation.erlang_list_to_binary.results:
35 50
27 37
74 73

65 53
49 63
103 65

76.677 → 142.257 (+85.53%) with p < 0.025
t = 2.2981012392583193, 1998 degrees of freedom

In this particular example, the :erlang.list_to_binary, on average, took 1.8553 times as long as the <>. Don’t interpret this to mean you should always use <>. The inputs in these examples are rather small. The README for the bmark project has a bit more detail on how to interpret these results.

Moar data

Now let’s try benchmarking a couple more approaches:

  bmark :io_list_all_known_data, runs: 1000 do
    [ANSI.red, "red", ANSI.reset, ANSI.green, "green", ANSI.reset]
    |> IO.puts
  end

  bmark :io_list_append_to_end, runs: 1000 do
    # Simulates converting an incoming stream of unknown data to
    # into an IO list and printing.
    [[[[[[ANSI.red], "red"], ANSI.reset], ANSI.green], "green"], ANSI.reset]
    |> IO.puts
  end

Let’s compare each of these new approaches against the <> results:

Left: :binary_concat_operator, Right: :io_list_all_known_data

65.195 → 68.618 (+5.25%) with p < 1
t = 0.48428845990199687, 1998 degrees of freedom

Interestingly <> is still faster, although we have little confidence about that because the p-value is as high as it can get!

Left: :binary_concat_operator, Right: :io_list_append_to_end

65.195 → 125.968 (+93.22%) with p < 0.025
t = 2.1281669445218134, 1998 degrees of freedom

<> crushes the deeply nested IO list, and we can be pretty confident about that for these particular inputs on my test machine. However, that may not be true for concatenating sequences of strings in general. To check that you’d need to do a lot more benchmarks with different kinds of inputs and ideally on several different kinds of machines.

Exercises for the reader

  1. Determine how long IO lists have to get before they beat <>, if at all.
  2. Does size of the binaries within the lists or the nesting depth affect the results, and if so how?
  3. Are any of these expressions reduced at compile-time? We’re benchmarking run-time speeds, so any work done by the compiler to optimize our inputs may explain surprising results.
  4. Do you suspect Elixir, Erlang, or any part of your cache hierarchy are giving you unrealistic results? How can you tell? Could you somehow reduce the influence of caching?
  5. How well do your benchmarks correspond to the way your code will be exercised in production? Do you have any data about production response times?
3 Likes

Aren’t you timing the performance of IO.puts in those calls? It should package up the data, send it to the leader via a message, then continue on, which that will probably be the most costly bit of the benchmark by far (due to data copying because of small binaries).

However yes, IOLists are more efficient with larger sets of binaries and bits of other ‘stuff’ thrown in. :slight_smile:

At smaller scale it is not.

Yes, but I’m doing the same thing for all of them so statistically significant results are still meaningful.

Is that the least expensive way to guarantee an IO list gets serialized to a binary in the same way it does for IO.puts?

I’m still learning Elixir, and I’m not sure where the IO list flattening occurs. Is it always done by List.flatten or can it even be deferred to Erlang/BEAM through another path?

One thing’s for sure: benchmarking is much more insightful when you have a good understanding of the internals you’re exercising.

Not in this case, the different data structures will be sent around differently. :slight_smile:

What happens for an IOList is this when output to IO, like I’ll use phoenix as an example here:

You have a template, it gets generated as a render function that takes arguments and output an IOList, a simple example may be:
Template:

Hello <%= @name %>!

Output:

def render(assigns), do: ["Hello ", assigns.name, "!"]

And with lots of loops and repeated string you can see how things can be re-used, so say you have something like a function that returns a set of HTML, like an input element or so, and you have a form that has lots of these, the binaries for those only exist once but are pointed to by the IOList many times, thus saving potentially a lot of memory in addition to holding it in cache better, this is when an IOList can be useful in-program especially, however when outputting to the network socket is when it shines.

Say you made a template that was one big giant binary of 10k in size, it will serialize that out to the socket decently fast and linearly, however the OS has a call that can take a ‘list’ memory locations and lengths to output, this is what an IOList uses when output on an IO socket, so when you output a template that has a lot of repeated parts the kernel will output the first set in the list, then the second, then the third, which could just be the part pointed to by the first again, then a fourth, which could just be ‘part’ of another memory section, etc… It holds less in RAM, less in the CPU cache, and it can all be done in-kernel without doing a lot of context switches.

Thus, testing what you did above, which had no repeated binaries, no large binaries that could be put on the ‘global’ heap, no repeated segments, then this is basically a worst-case for IOLists, so your benchmark will show that IOList is the worst, but for the case of, say, an HTML template, lots of repeated elements, can keep huge parts in memory in the shared binary heap, etc… then IOLists ‘should’ perform not just better but potentially a lot better (hardware depending, but even at worst it will not be slow by any stretch). :slight_smile:

3 Likes

Good point. I need to poke around the Elixir and Erlang source to understand how IO works with various terms and what situations cause a term to be serialized to a binary.

Thanks for the explanation. The benefits of IO lists aren’t lost on me. They turn what could be a quadratic time operation, O(n^2), into a linear time operation, O(n). It’s good to use them for the same reason you’d want to use Java’s StringBuilder instead of its string concatenation operator, +. IO lists may be even better because, as you mentioned with network sockets, they can be streamed out in contiguous chunks without ever needing to create a new binary from two existing binaries using <>. You also make a great point about the cache hierarchy working better with many small binaries than a single behemoth binary.

The benchmarks should not be taken seriously, although they could be a useful starting point for anyone wanting to delve deeper.

2 Likes

Exactly! ^.^

As @OvermindDL1 writes, IOList is “flattened” efficiently in Erlang VM, so the programmers do not have to take care of flattening the list.

I later discovered that Plug.conn.send_resp/3 took the HTML body argument as an IOlist, so I didn’t have to use :erlang.list_to_binary/1 anyway. I’ve also rewritten another short expression using <>. So now I’ve got rid of the :erlang function :smile:

1 Like

Here’s one write-up of how IO lists can be helpful without flattening all the way down at the OS level: https://www.bignerdranch.com/blog/elixir-and-io-lists-part-2-io-lists-in-phoenix/

4 Likes