I’ve been diving deep into charlists, encodings, code points, etc… thank you to the community who have humored me in the forums and in Slack.
I have a question about Elixir’s use of the term “binary”. It seems like the term was bogarted… if I showed you a series of 1’s and 0’s, you’d think “Ah, that’s binary.” As in, that data is being represented in a binary format.
But nope, in Elixir, a binary is for all intents and purposes synonymous with a string in other languages. But in Elixir, we don’t take “binary” to mean a series of 1’s and 0’s, but rather, a binary means “a sequence of bytes” (i.e. yes, a sequence of 1’s and 0’s, but specifically, groups of 8 of them).
This has the weird side-effect of making us refer to ACTUAL series of 1’s and 0’s as “Base2” encoding, which isn’t supported in the core, but is available in a little-known package: https://hex.pm/packages/base2
If you mean direct-to-“string” encoding, I’d say that’s because as a general rule, elixir’s core is restricted to “do you need it to build elixir”? Using direct string binary encoding is pretty rare in the wild.
As for why we use the term binary, that’s because that’s what erlang chose. And, confusingly, there are bitstrings, which are like binaries, but more like what you are getting at, (or to be specific, binaries are a subset of bitstrings). At some point, you wind up asking questions, like, why do we call “asia” asia? Isn’t asia just turkey?
Also, a subtlety in Elixir: By convention, a String is a binary of UTF-8 codepoints. It’s the oft-overlooked first line in the documentation. For example, a String probably shouldn’t have the byte <<0>> inside of it anywhere. You should (but nothing in runtime will complain if you don’t) use String.t only when <<0>> is not there and multibyte codepoints are to be counted as one character, and you should use binary to refer to a memory-contiguous collection of bytes that might contain a <<0>>, like an encrypted password, or packets coming off of the network, or stuff read directly off of disk.
Correspondingly, String functions may not work as expected under certain conditions or may have unexpected performance regressions if operating on data that is not UTF-8 formatted.
I confess, I HATE that line in the documentation because it’s explains virtually nothing and worse, I think it’s misleading (or at least, brutally confusing). Full disclosure: I’m trying to come up with a PR for that page that will help clarify the confusion around this.
A codepoint, as far as I understand, is a positive integer that corresponds to some character or control code in some “code space” – in our case, the code space is defined as the Unicode characters. So a “binary” in Elixir could be said to be a list of utf8-encoded codepoints (because it’s not actually a list of code points, but rather utf8’s encoding of the code point numbers). The way the sentence currently reads only makes sense if you already have a firm grasp on code points and encodings.
Relatedly, there is the confusingly named (in my opinion) String.codepoints/1 function which returns NOT the integer numbers (i.e. the code points), but instead the individual characters. I would expect String.codepoints("cät") to output [99, 228, 116] and not simply ["c", "ä", "t"]
So String.codeopints("cät") does nothing for us that String.split("cät", "", trim: true) wouldn’t give us already. In other words, there seems to be a lack of agreement as to whether a code point is the NUMBER or the CHARACTER. (Also, there doesn’t appear to be agreement between how to write code point (codepoint?), but that’s less important).
Why did Erlang use the term “binary” when it had other connotations?
I meant “list” here in the general sense. “String” (as in a string of fish) might make sense, but given that we’re trying to explain what exactly strings are, I avoided that term). It’s really hard to come up with a terse definition that is correct, educates well, and does not mislead. Maybe “series” is a better term? “Array”? “Ordered set”?
What’s the real difference between a charlist and a binary? Is it merely that a charlist’s elements MUST be codepoints? Whereas binaries can be any sequence of … bytes? Is THAT correct? Or are binaries any sequence of integers? How does Elixir (or Erlang?) know the size of the integers in a binary?
Just to make sure we’re on the same page, charlists and binaries are entirely different data structures. Charlists are lists of integers, where each integer is a valid code point. A charlist is an actual list through, in a is_list(list) #=> true sense. Binaries are not lists, even a little bit.
A bitstring is a fundamental data type, denoted with the syntax <<>>. It is a contiguous sequence of bits in memory.
A binary is a byte aligned bitstring, which is to say it is a sequence of whole bytes. They can be any bytes at all.
A string is a binary where all the bytes form a valid UTF8 sequence.
Elixir gives you the “i” command out of the box which might help, too:
iex(2)> i "ä"
This is a string: a UTF-8 encoded binary. It's printed surrounded by
"double quotes" because all UTF-8 encoded code points in it are printable.
Collectable, IEx.Info, Inspect, List.Chars, String.Chars
Why did Erlang use the term “binary” when it had other connotations?
It’s literally the only datastructure in the erlang typesystem that lets you look at binary data under the hood as exactly as it’s represented in memory (ok, that’s a bit of a lie, but let’s not get miss the beautiful forest here for the dark patch of the endianness trees)
FWIW, there is nowhere in the docs with that line. If they were, they would indeed be incorrect.
Other than that, you are right, codepoints are integers. String.codepoints returns codepoints as UTF-8 encoded binaries, i.e. codepoints as strings. The precise definition of this would be “code unit” but we wanted to avoid introducing yet another term. I have updated the docs to make it clear that String.codepoints returns an encoded representation, not integers. Thanks.
I apologize, I paraphrased what was running through my head slightly inaccurately (since codepoint numbers don’t exactly go to bytes exactly as one might expect). The first line in the documentation says:
Strings in Elixir are UTF-8 encoded binaries.
Just wanted to make sure that that is, indeed correct.
You are matching on a binary that has a length of a single codepoint, though "\0308t" has a length of 2 codepoints. The trailing t is meant as this, a literal t in the string, its not part of the \u escape sequence
Do you feel it would make sense to have an option for String.codepoints/1 so it could return integers (or hex representations)? Part of the confusion for me is having this function that doesn’t exactly return what its name would suggest. We say “codepoints are integers”, and then we immediately equivocate them with UTF-8 encoded binaries – it’s no wonder this confuses so many people.
Something like the following would make a bit more sense to me, I think it would better communicate what’s going on: