Is there a way to compare iolists for equality without converting them to binary?

idi527 · November 26, 2018, 5:38pm

Or is a custom function needed?

Something like (currently incorrect, doesn’t account for nested lists):

defmodule Comparer do
  @moduledoc """
  Equality comparison for iolists consisting of binaries.
  """

  @spec equal?(iolist, iolist) :: boolean
  def equal?(iolist1, iolist2)

  def equal?(iolist1, iolist2) do
    equal?(iolist1, iolist2, [])
  end

  defp equal?(
         [<<char, rest_bin1::bytes>> | rest_iolist1],
         [<<char, rest_bin2::bytes>> | rest_iolist2],
         []
       ) do
    equal?([<<rest_bin1::bytes>> | rest_iolist1], [<<rest_bin2::bytes>> | rest_iolist2], [])
  end

  defp equal?(
         [<<>> | rest_iolist1],
         [<<>> | rest_iolist2],
         []
       ) do
    equal?(rest_iolist1, rest_iolist2, [])
  end

  defp equal?(
         [<<>> | rest_iolist1],
         iolist2,
         []
       ) do
    equal?(rest_iolist1, iolist2, [])
  end

  defp equal?(
         iolist1,
         [<<>> | rest_iolist2],
         []
       ) do
    equal?(iolist1, rest_iolist2, [])
  end

  defp equal?(
         [<<char, rest_bin1::bytes>> | rest_iolist1],
         rest2,
         [<<char, rest_bin3::bytes>> | rest_buffer2]
       ) do
    equal?([<<rest_bin1::bytes>> | rest_iolist1], rest2, [<<rest_bin3::bytes>> | rest_buffer2])
  end

  defp equal?(
         [<<>> | rest_iolist1],
         rest2,
         [<<>> | rest_buffer2]
       ) do
    equal?(rest_iolist1, rest2, rest_buffer2)
  end

  defp equal?(
         [<<>> | rest_iolist1],
         rest2,
         buffer2
       ) do
    equal?(rest_iolist1, rest2, buffer2)
  end

  defp equal?([], [], []) do
    true
  end

  defp equal?(_, _, _) do
    false
  end
end

NobbZ · November 26, 2018, 6:31pm

Equality is easy, just use ==/2. But I fear it’s equivalency you want to check. And to be honest, I think flushing them into binary and compare equality on those should be the easiest thing.

idi527 · November 26, 2018, 6:37pm

Just benchmarked my incomplete implementation from above against IO.iodata_to_binary followed by ==, and my implementation was 40 times slower.

iolist1 = ["SELECT ", "a,", "b,", "c", " FROM somewhere", " WHERE a = 5"]
iolist2 = ["SELECT ", "a,", "b,", "c", " FROM somewhere WHERE a = 5"]

Benchee.run(%{
  "Comparer.equal?/2" => fn -> Comparer.equal?(iolist1, iolist2) end,
  "to binary and then compare" => fn ->
    IO.iodata_to_binary(iolist1) == IO.iodata_to_binary(iolist2)
  end
})

Benchmarking Comparer.equal?/2...
Benchmarking to binary and then compare...

Name                                 ips        average  deviation         median         99th %
to binary and then compare        1.40 M        0.71 μs ±10280.07%           1 μs           1 μs
Comparer.equal?/2               0.0342 M       29.28 μs   ±107.85%          23 μs         112 μs

Comparison:
to binary and then compare        1.40 M
Comparer.equal?/2               0.0342 M - 40.97x slower

hauleth · November 26, 2018, 8:34pm

@NobbZ already told you that equality != equivalence, but let see why you cannot compare iolists and why your comparator is terribly wrong.

See what Typespecs say about iolist type:

iolist() maybe_improper_list(byte() | binary() | iolist(), binary() | [])

But what is improper list?

So all of us get used to it that list in Erlang is head and tail or an empty list (yes empty list is special case of list). So when we do [a | b] = list then is_list(b) == true. Simple. But what when it isn’t? Nothing in Erlang prevents you from doing [1 | 2]. Yes this is proper syntax, yes it will compile, yes it is used. This is called improper list and your Comparer will fail on that.

Second thing is that iolist can be:

infinitely nested, so [[[[[["a"]]]]]] is also proper iolist that is equivalent to ["a"]
can contain raw bytes, so ["a", 97] is equivalent to ["aa"]
can contain Erlang strings, which are lists ['a'] == [[97]] == [[?a]] which is equivalent to ["a"]

So some hard examples for you to compare:

["foo", 'bar' | ?z] vs [["foo" | "bar"], "z"]
['a' | "ą"] vs ["aą"]
["aa", ?a, 'aa'] vs ["a", ?a, 'aa' | "a"]

idi527 · November 26, 2018, 8:42pm

Thanks for the detailed reply! There are certainly improvements possible but it seems like they are unnecessary since I’m unlikely to match IO.iodata_to_binary/1 in terms of performance. In my original question I just wondered if there is a BIF or NIF that I could use to compare the iolists for equality / equivalence (I still hope there is), the Comparer was used to better illustrate what I need form it and check if there is any performance gain to be had.

P.S. I only needed to compare what my implementation compares (no charlists, no integers / bytes). I had a more involved implementation that allowed for improper and arbitrary nested lists, but it also was quite slow, so I gave up. I have also not checked if any of the binary optimizations are used during compilation like maintaining the match context (I strongly doubt), so there’s that as well. If I had to actually implement it right now, I’d probably write a NIF.

hauleth · November 26, 2018, 9:28pm

If you write that NIF then you will see that it will still be slower unless you use writev(2) which will make it fast again. And when you use that you will just implement exactly what erlang:iolist_to_binary/1 is doing. You can read more about it there and there.

idi527 · November 26, 2018, 9:30pm

I don’t need to write the iolists anywhere, just to compare them which would possibly just amount to following pointers. But thanks again.

Also, couldn’t find any calls to writev in https://github.com/erlang/otp/blob/master/erts/emulator/beam/binary.c, only in the socket/file handling libs, so it doesn’t seem like erlang:iolist_to_binary/1 is using writev …

rvirding · November 26, 2018, 9:50pm

The reason converting the iolists to binaries and then doing and equals test is that these are both implemented in the VM in C while your code is in Elixir. No, will not go faster if you were to write it in Erlang. [*]

You can probably make your code more efficient when comparing binaries by directly comparing the longest leading chunk in one go instead of doing it byte by byte. And it is fun code to write as well. Another fun function is one that takes an iolist and returns the Nth byte.

[*] Of course the Erlang would be more beautiful.

idi527 · November 26, 2018, 9:52pm

I don’t actually want that, I wonder if there is a NIF or a BIF I could use to compare iolists. I’ve tried to search through erlang/otp repo, and couldn’t find anything, even though it seems like a common use case to me.

rvirding · November 26, 2018, 9:54pm

None that I know of. The thing to be aware of is that there is no real iolist data type, it is just a nested recursive structure which happens to be interpreted in a special way in some cases. The main one being that on output the BEAM will automagically flatten it and output the bytes. This means that in many case you don’t have to pay the price of flattening it.

That should be all now.