Where did the name "binaries" come from? And how does this relate to Base2

Nicd · March 5, 2020, 6:58am

Nice work! It’s quite thorough and could even help users of other languages understand how Unicode, UTF-8 and binary data works.

Nitpicks:

The binary-size(n) modifier will match n characters in a binary

This should probably read: will match n bytes

so why does Elixir refer to its strings as “binaries”?

This section kind of walks back on the things taught before it. Elixir doesn’t call its strings binaries, strings just happen to be binaries formed in a certain way (i.e. UTF-8 encoded data).

For me the simplest reason binaries are called binaries is that they are blobs of data. Compare to the term BLOB (binary large object) used in e.g. databases, or “binary format” in FTP, or the binary vs text files distinction in many filesystem tools – it’s not a foreign term in this context. It’s untagged data and without knowing what it is beforehand, you cannot use reflection or inspection to determine what it really is. Maybe this doesn’t match what the language authors thought but I don’t see why it should be more complicated than that.

But I digress. Good work on the page.

NobbZ · March 5, 2020, 7:59am

I’m not sure what you are saying… If you have Gb/s, then you need to divide by 8 to get GO/s as well…

Though I admit, abbreviations were a little bit more clear… Especially as not everyone follows the convention to lowercase the b when bits are meant…

I prefer just spelling those things out…

And confusion about Mibi vs Mega is much much worse than bytes vs bits…

bjorng · March 5, 2020, 12:24pm

I agree.

I was not there when binaries were introduced into Erlang, but what I heard is that they were first used to store compiled Erlang code (for JAM at that time). That was necessary in order to implement erl_boot_server that allows a diskless client to fetch code to be loaded from another node.

The idea to use binaries to store text strings is of much later date. Originally, all binaries were reference-counted, which is practical for storing a relatively small number of large binaries, but not for storing many small binaries. OTP R7 introduced the bit syntax and heap binaries. (See How binaries are implemented for more information.)

fireproofsocks · March 5, 2020, 4:45pm

Snap! Good catch! Is there any way to do pattern matching within <<>> to grab the first character (even a multi-byte character)?

Something where

iex> <<x::something????, rest::binary>> = "über"
"über"
iex> x
ü

Nicd · March 5, 2020, 5:00pm

I don’t think so, because you can’t know how many bytes or codepoints a “character” consists of.

NobbZ · March 5, 2020, 5:01pm

utf8 will extract the codepoint, which you can then reassemble:

iex(1)> <<c::utf8, _::binary>> = "über"
"über"
iex(2)> c
252
iex(3)> <<c::utf8>>
"ü"

Nicd · March 5, 2020, 5:02pm

Yep but even then it often doesn’t match a user perceived character, i.e. a grapheme cluster.

tristan · March 5, 2020, 5:07pm

I noticed there was some confusion about codepoints vs graphemes in this thread so thought I’d plug the Unicode section of Adopting Erlang :), https://adoptingerlang.org/docs/development/hard_to_get_right/#handling-unicode

fireproofsocks · March 5, 2020, 9:55pm

I wonder if you could write a guard clause like:

def my_func(x <> rest)) when x in ["ü", "ß", "†"], do: # ....

?

hauleth · March 5, 2020, 10:07pm

No, as a <> b is sugar for <<a::binary, b::binary>> and you cannot match binaries of unknown size.

NobbZ · March 6, 2020, 9:07am

What you can do is this:

def f(<<x::utf8, _::binary>>) when x in ~c[üéæ], do: ...
# or
def f(<<x::utf8, _::binary>>) when <<x::utf8>> in ~w[ü é æ], do: ...

This does of course only work if the combined normalisation is used and not if the grapheme is combined from two codepoints.

lud · March 6, 2020, 10:15am

Nice work on the docs

If think I found a typo : x == ?u, you are not testing for ?ü.

fireproofsocks · March 6, 2020, 4:20pm

Good catch! Updating!

rvirding · March 9, 2020, 7:53pm

I think we called it a binary just to separate from all other Erlang terms. It is just a collection, or array, of raw bytes where we add meaning to it. If the number of bits is not divisible by 8 then it is a bitstring. Again we are the ones who add meaning to it not Erlang/Elixir/BEAM.

Remember that in Erlang we mostly represent strings as lists of integers. Again we put meaning to the bytes, whether they are ascii/latin1/unicode codepoints. We NEVER make lists of utf-8/16/32 encoded characters. What’s the point? There was a good talk on CodeBeamSF on this by Marc Sugiyama Unicode, Charsets, Strings, and Binaries. I don’t know when it will be on line.