Where did the name "binaries" come from? And how does this relate to Base2

fireproofsocks · February 28, 2020, 10:06pm

I’ve been diving deep into charlists, encodings, code points, etc… thank you to the community who have humored me in the forums and in Slack.

I have a question about Elixir’s use of the term “binary”. It seems like the term was bogarted… if I showed you a series of 1’s and 0’s, you’d think “Ah, that’s binary.” As in, that data is being represented in a binary format.

But nope, in Elixir, a binary is for all intents and purposes synonymous with a string in other languages. But in Elixir, we don’t take “binary” to mean a series of 1’s and 0’s, but rather, a binary means “a sequence of bytes” (i.e. yes, a sequence of 1’s and 0’s, but specifically, groups of 8 of them).

This has the weird side-effect of making us refer to ACTUAL series of 1’s and 0’s as “Base2” encoding, which isn’t supported in the core, but is available in a little-known package: https://hex.pm/packages/base2

Can someone enlighten me on this?

Thanks!

ityonemo · February 28, 2020, 10:14pm

which isn’t supported in the core

iex(1)> 0b0101
5

If you mean direct-to-“string” encoding, I’d say that’s because as a general rule, elixir’s core is restricted to “do you need it to build elixir”? Using direct string binary encoding is pretty rare in the wild.

As for why we use the term binary, that’s because that’s what erlang chose. And, confusingly, there are bitstrings, which are like binaries, but more like what you are getting at, (or to be specific, binaries are a subset of bitstrings). At some point, you wind up asking questions, like, why do we call “asia” asia? Isn’t asia just turkey?

ityonemo · February 28, 2020, 10:20pm

Also, a subtlety in Elixir: By convention, a String is a binary of UTF-8 codepoints. It’s the oft-overlooked first line in the documentation. For example, a String probably shouldn’t have the byte <<0>> inside of it anywhere. You should (but nothing in runtime will complain if you don’t) use String.t only when <<0>> is not there and multibyte codepoints are to be counted as one character, and you should use binary to refer to a memory-contiguous collection of bytes that might contain a <<0>>, like an encrypted password, or packets coming off of the network, or stuff read directly off of disk.

Correspondingly, String functions may not work as expected under certain conditions or may have unexpected performance regressions if operating on data that is not UTF-8 formatted.

fireproofsocks · February 28, 2020, 10:56pm

a String is a binary of UTF-8 codepoints

I confess, I HATE that line in the documentation because it’s explains virtually nothing and worse, I think it’s misleading (or at least, brutally confusing). Full disclosure: I’m trying to come up with a PR for that page that will help clarify the confusion around this.

A codepoint, as far as I understand, is a positive integer that corresponds to some character or control code in some “code space” – in our case, the code space is defined as the Unicode characters. So a “binary” in Elixir could be said to be a list of utf8-encoded codepoints (because it’s not actually a list of code points, but rather utf8’s encoding of the code point numbers). The way the sentence currently reads only makes sense if you already have a firm grasp on code points and encodings.

Relatedly, there is the confusingly named (in my opinion) String.codepoints/1 function which returns NOT the integer numbers (i.e. the code points), but instead the individual characters. I would expect String.codepoints("cät") to output [99, 228, 116] and not simply ["c", "ä", "t"]

So String.codeopints("cät") does nothing for us that String.split("cät", "", trim: true) wouldn’t give us already. In other words, there seems to be a lack of agreement as to whether a code point is the NUMBER or the CHARACTER. (Also, there doesn’t appear to be agreement between how to write code point (codepoint?), but that’s less important).

Why did Erlang use the term “binary” when it had other connotations?

NobbZ · February 28, 2020, 11:03pm

bitstrings are everything what can be represented using <<>>
binaries are bitstringths which length is a multiple of 8
Strings are binaries which bytes represent valid utf-8 encoded codepoints.

First 2 terms are from Erlang

axelson · February 28, 2020, 11:20pm

You might realize this already, but that statement isn’t correct as-written since a binary is not a list and a binary can contain any sequence of bytes and does not need to be utf8-encoded.

fireproofsocks · February 29, 2020, 12:12am

I meant “list” here in the general sense. “String” (as in a string of fish) might make sense, but given that we’re trying to explain what exactly strings are, I avoided that term). It’s really hard to come up with a terse definition that is correct, educates well, and does not mislead. Maybe “series” is a better term? “Array”? “Ordered set”?

What’s the real difference between a charlist and a binary? Is it merely that a charlist’s elements MUST be codepoints? Whereas binaries can be any sequence of … bytes? Is THAT correct? Or are binaries any sequence of integers? How does Elixir (or Erlang?) know the size of the integers in a binary?

fireproofsocks · February 29, 2020, 1:08am

Related, I just published this package: https://hex.pm/packages/xray
It was born out of my explorations in binary / code-point land…

benwilson512 · February 29, 2020, 1:43am

Just to make sure we’re on the same page, charlists and binaries are entirely different data structures. Charlists are lists of integers, where each integer is a valid code point. A charlist is an actual list through, in a is_list(list) #=> true sense. Binaries are not lists, even a little bit.

A bitstring is a fundamental data type, denoted with the syntax <<>>. It is a contiguous sequence of bits in memory.

A binary is a byte aligned bitstring, which is to say it is a sequence of whole bytes. They can be any bytes at all.

A string is a binary where all the bytes form a valid UTF8 sequence.

sribe · February 29, 2020, 1:47am

A charlist is really a list–a linked list; a binary is not, it is a contiguous block of binary data.

ityonemo · February 29, 2020, 2:02am

we haven’t even gotten to iolists versus iodata, and what is and isn’t allowed to terminate an improper iolist

ityonemo · February 29, 2020, 2:07am

Elixir gives you the “i” command out of the box which might help, too:

iex(2)> i "ä"
Term
  "ä"
Data type
  BitString
Byte size
  2
Description
  This is a string: a UTF-8 encoded binary. It's printed surrounded by
  "double quotes" because all UTF-8 encoded code points in it are printable.
Raw representation
  <<195, 164>>
Reference modules
  String, :binary
Implemented protocols
  Collectable, IEx.Info, Inspect, List.Chars, String.Chars

ityonemo · February 29, 2020, 2:17am

Why did Erlang use the term “binary” when it had other connotations?

It’s literally the only datastructure in the erlang typesystem that lets you look at binary data under the hood as exactly as it’s represented in memory (ok, that’s a bit of a lie, but let’s not get miss the beautiful forest here for the dark patch of the endianness trees)

al2o3cr · February 29, 2020, 2:40am

These two are not the same - String.split with an empty string splits on graphemes, which are a further unit bigger than codepoints.

The rules for clustering codepoints into graphemes are defined by the Unicode standards, and the code for handling them is generated from canonical text files.

iex(4)> String.codepoints("🇺🇸")
["🇺", "🇸"]
iex(5)> String.split("🇺🇸", "", trim: true)
["🇺🇸"]
iex(6)> "🇺🇸" <><<0>>
<<240, 159, 135, 186, 240, 159, 135, 184, 0>>

The single displayed character is a grapheme, composed of two codepoints U+1F1FA and U+1F1F8, represented by 8 bytes.

Another way that codepoints and graphemes can diverge is combining characters; for instance, U+0308 is “Combining Diaresis” which will add ¨ to the preceding character. Example:

iex(9)> s = "ca\u0308t"
"cät"
iex(10)> String.codepoints(s)
["c", "a", "̈", "t"]
iex(11)> String.split(s, "", trim: true)
["c", "ä", "t"]

(note that the combining character prints very oddly when isolated inside ")

josevalim · February 29, 2020, 10:44pm

FWIW, there is nowhere in the docs with that line. If they were, they would indeed be incorrect.

Other than that, you are right, codepoints are integers. String.codepoints returns codepoints as UTF-8 encoded binaries, i.e. codepoints as strings. The precise definition of this would be “code unit” but we wanted to avoid introducing yet another term. I have updated the docs to make it clear that String.codepoints returns an encoded representation, not integers. Thanks.

ityonemo · March 1, 2020, 12:15am

I apologize, I paraphrased what was running through my head slightly inaccurately (since codepoint numbers don’t exactly go to bytes exactly as one might expect). The first line in the documentation says:

Strings in Elixir are UTF-8 encoded binaries.

Just wanted to make sure that that is, indeed correct.

benwilson512 · March 1, 2020, 3:30am

This is indeed correct.

fireproofsocks · March 3, 2020, 6:00pm

This is a great example – very educational.

How can you match on the \u0308t? The following results in a match error:

<<x::utf8>> = "\u0308t"

But String.valid?/1 and String.printable?/2 both return true.

Are there other unicode characters that cannot be matched? And how can we deal with parsing them?

NobbZ · March 3, 2020, 6:03pm

You are matching on a binary that has a length of a single codepoint, though "\0308t" has a length of 2 codepoints. The trailing t is meant as this, a literal t in the string, its not part of the \u escape sequence

fireproofsocks · March 3, 2020, 6:08pm

Do you feel it would make sense to have an option for String.codepoints/1 so it could return integers (or hex representations)? Part of the confusion for me is having this function that doesn’t exactly return what its name would suggest. We say “codepoints are integers”, and then we immediately equivocate them with UTF-8 encoded binaries – it’s no wonder this confuses so many people.

Something like the following would make a bit more sense to me, I think it would better communicate what’s going on:

iex> String.codepoints("cat", as: :binaries)
["c", "a", "t"]
iex> String.codepoints("cat", as: :integers)
[99, 97, 116]
iex> String.codepoints("cat", as: :hex)
["0063", "0061", "0074"]