Nice work! It’s quite thorough and could even help users of other languages understand how Unicode, UTF-8 and binary data works.
Nitpicks:
The binary-size(n) modifier will match n characters in a binary
This should probably read: will match n bytes
so why does Elixir refer to its strings as “binaries”?
This section kind of walks back on the things taught before it. Elixir doesn’t call its strings binaries, strings just happen to be binaries formed in a certain way (i.e. UTF-8 encoded data).
For me the simplest reason binaries are called binaries is that they are blobs of data. Compare to the term BLOB (binary large object) used in e.g. databases, or “binary format” in FTP, or the binary vs text files distinction in many filesystem tools – it’s not a foreign term in this context. It’s untagged data and without knowing what it is beforehand, you cannot use reflection or inspection to determine what it really is. Maybe this doesn’t match what the language authors thought but I don’t see why it should be more complicated than that.
I was not there when binaries were introduced into Erlang, but what I heard is that they were first used to store compiled Erlang code (for JAM at that time). That was necessary in order to implement erl_boot_server that allows a diskless client to fetch code to be loaded from another node.
The idea to use binaries to store text strings is of much later date. Originally, all binaries were reference-counted, which is practical for storing a relatively small number of large binaries, but not for storing many small binaries. OTP R7 introduced the bit syntax and heap binaries. (See How binaries are implemented for more information.)
I think we called it a binary just to separate from all other Erlang terms. It is just a collection, or array, of raw bytes where we add meaning to it. If the number of bits is not divisible by 8 then it is a bitstring. Again we are the ones who add meaning to it not Erlang/Elixir/BEAM.
Remember that in Erlang we mostly represent strings as lists of integers. Again we put meaning to the bytes, whether they are ascii/latin1/unicode codepoints. We NEVER make lists of utf-8/16/32 encoded characters. What’s the point? There was a good talk on CodeBeamSF on this by Marc Sugiyama Unicode, Charsets, Strings, and Binaries. I don’t know when it will be on line.