Do I have these string definitions right?

JEG2 · July 30, 2016, 4:16pm

Elixir’s terminology related string data gets a little twisty. Do I properly understand these terms?

binary: A <<…>> construct with a size divisible by 8
bitstring: A <<…>> construct with a size not divisible by 8
charlist: A proper or improper list of numeric codepoints, binaries, and/or nested charlists
chardata: A charlist or a binary
iolist: A proper or improper list of numeric bytes, binaries, and/or nested iolists
iodata: An iolist or a binary

I think I’m the least sure about the term charlist. It seems to be defined two ways in Erlang: as a list of codepoints or the definition I used above. If you can share any insight into this duality, you are my new best friend!

Qqwy · July 30, 2016, 7:19pm

I am fairly certain that charlist is supposed to just mean ‘a proper or improper list of numeric codepoints’. iolist should on the other hand be able to contain charlists as well. I think the second Erlang documentation page has confused these definitions.

As far as I know, in Elixir we always refer to ‘a proper or improper list of numeric codepoints’ as a charlist:

iex> i 'foo' 
Term
  'foo'
Data type
  List
Description
  This is a list of integers that is printed as a sequence of characters
  delimited by single quotes because all the integers in it represent valid
  ASCII characters. Conventionally, such lists of integers are referred to as
  "charlists" (more precisely, a charlist is a list of Unicode codepoints,
  and ASCII is a subset of Unicode).
Raw representation
  [102, 111, 111]
Reference modules
  List

So:

charlist: ‘a proper or improper list of numeric codepoints’.
iolist: 'a proper or improper list of charlists, binaries and/or nested iolists.

Most of the other definitions are spot-on, with the exception of binary and bitstring:

binary is anything written like <<..>>, e.g. a chunk of zero or more bits of data.
bitstring is used somewhat arbitrarily inside the Elixir, but any binary that is not a string is a bitstring. See the information in IEx: i <<1::size(1)>>
string is to be meant ‘a bitstring with a size divisible by 8 in which all codepoints and their order conforms to the UTF-8 standard’.

The set of strings is a subset of the set of bitstrings. All bitstrings are binaries and vice-versa. People however often use string to mean chardata, and bitstring to mean string, to explain to newcomers how strings work in Elixir.

Naming things truly is one of the hardest things programmers face.

JEG2 · July 30, 2016, 9:48pm

I don’t think the non-Unicode definition of charlist allows for improper lists. See the definition of string() here and charlist() here. iex doesn’t flag them as such:

iex(1)> i [97 | 98]
Term
  [97 | 98]
Data type
  List
Reference modules
  List

However, Elixir has at least one use of chardata and we can see from that documentation that it’s based on the more complex idea of a charlist.

I don’t think this is right. Dialyzer does not allow <<1::size(1)>> to pass as a binary, for example. Here’s some code:

defmodule TypePlayground do
  @spec test(binary) :: binary
  def test(b) do
    b
  end

  def start do
    test(<<1::size(1)>>)
  end
end

Here’s the complaint from Dialyzer:

$ mix dialyzer
Compiling 1 file (.ex)
Starting Dialyzer
dialyzer --no_check_plt --plt /Users/jeg2/.dialyxir_core_19_1.3.2.plt -Wunmatched_return
s -Werror_handling -Wrace_conditions -Wunderspecs /Users/jeg2/Desktop/type_playground/_b
uild/dev/lib/type_playground/ebin                                                      
  Proceeding with analysis...
type_playground.ex:7: Function start/0 has no local return
type_playground.ex:8: The call 'Elixir.TypePlayground':test(<<_:1>>) breaks the contract
 (binary()) -> binary()                                                                
 done in 0m1.59s
done (warnings were emitted)

sinetris · July 31, 2016, 2:39am

Hi James, mostly correct

Tl;dr all correct but:

bitstring can be divisible by 8
charlist in Erlang is like iolist but for unicode; in Elixir is a list of numeric representation of unicode chars.

Extended version

A bitstring is a sequence of zero or more bits, where the number of bits does not need to be divisible by 8. If the number of bits is divisible by 8, the bitstring is also a binary. (From Erlang bit syntax doc.)

In Erlang a list of codepoints is called a string. The confusion originate from the fact that Erlang strings are similar to char list in Elixir.

'olé' # [111, 108, 233]

and Elixir strings are binary encoded in UTF8.

<<"olé">> === "olé" # true

The Erlang charlist you linked is the unicode:charlist()
An explanetion for charlist with an example:

[8364, <<" is the euro sign">>] |> to_string # "€ is the euro sign"

Erlang iolist is almost the same as Erlang charlist but with the limitation to 8bit. So the previous example is not a valid iolist but this one is:

[111, 108, 233, <<"olé">>] |> to_string # "oléolé"

I hope this clarify something, or at least don’t create more confusion

Duilio

rvirding · July 31, 2016, 11:21am

In Erlang we generally call a list of code points a string. If you check out about a third of the way down in your first reference you will see that the type string() is defined as [char()].

JEG2 · July 31, 2016, 3:01pm

Thanks. This was super helpful!

josevalim · August 1, 2016, 9:08pm

Great answer!

I would just like to add that this is basically why we chose strings to be represented as binaries instead of lists. If you see a list of integers, you don’t know if the integers are meant to represent bytes (iolist/iodata) or unicode codepoints (charlist/chardata). With binaries, because you have the underlying representation, you don’t have this dichotomy: it is what it is.

nathanl · August 2, 2016, 4:00pm

With binaries, because you have the underlying representation, you don’t have this dichotomy: it is what it is.

I don’t think I understand this comment, but I really want to.

As a concrete example: [97, 98] can be interpreted as 'ab', or it can be interpreted as two integers. The same is true for <<97, 98>>, right? It seems like either way, you have data that definitely can’t represent a string (like [0,97] or <<0,97>>, because it would be invalid) and data that could be interpreted as a string or not.

I must be missing something.

josevalim · August 2, 2016, 6:10pm

@nathanl that’s a very good question.

You are correct. When you are processing a binary, you need to know what is the encoding of the data in the binary. If I have functions that expect a UTF-8 encoded binary and I pass a UTF-16 binary, the results would be incorrect. That’s similar to a list. If I have a list, I need to know what the integers represent.

My comment was exclusive to sending/writing the data. Because the data in a binary is already encoded, functions that write to files, sockets, io, etc do not care about the encoding of the binary because the data is already encoded. That’s a contrast to lists where I need to know what the integers in them represent to properly send/write the data.

josevalim · August 2, 2016, 6:12pm

I also would like to add two more things:

<<0, 97>> is a valid string (and by string I mean a UTF-8 encoded binary)
The UTF-8 is self-synchronizing. This means that if you see some bad bytes you can discard them (or keep them or replace them by <?>) and move to process the upcoming bytes as usual

CharlesO · August 3, 2016, 2:45am

"hello" === <<104,101,108,108,111>> this exactly gets sent on the wire

'hello' === [104, 101, 108, 108, 111] what gets sent on the wire here ?

josevalim · August 3, 2016, 7:03am

It depends on what the integers represent and what the io device is configured to write. That’s the whole point: you don’t know what will be sent without more information. In this case, while both bytes and UTF-8 representation would be the same, if they are meant to be sent as UTF-16 binaries, it would be 0, 104, 0, 101, 0, 108, 0, 108, 0, 111.

rvirding · August 3, 2016, 10:32pm

You do have a similar problem with binary strings. If they are meant to be sent as UTF-16 then the binary string will be wrong as it is in UTF-8. You always need to keep track of what you have and what you need. UTF-8 is a common way though.

josevalim · August 3, 2016, 10:47pm

The application needs to know what it is sending but the layer responsible for the sending itself, say gen_tcp, doesn’t need to know what it is. That’s impossible to do for lists without assuming or asking for more information.

rvirding · August 4, 2016, 11:39pm

Yes, the application does need to know what it is sending, but this is the same in both cases. Irrespective of whether it is a list string or a binary string it has to be in the right format, the bytes in it have to be correct.

My point is just that this is not really a reason for either representation.

Qqwy · August 5, 2016, 6:25am

Here are some more reasons why Binaries are a better candidate for string storage than Lists:

Binaries are guaranteed to only ever contain integers. Lists do not have this guarantee. This is the reason that many of the following optimizations are possible for Binaries.
Lists are ± 4x as large: Per list element, a pointer to the current element and a pointer to the tail of the element are kept, as well as the element itself, of course. Binaries, on the other hand, only contain their byte length, and then a sequence of that amount of (byte-size) integers.
Changing something in a list is slower: When an element in a list is changed, all elements in front of it have to be re-built. With binaries, a chunk of memory can just be copied, which probably means a reduction from linear time to constant time for both modification as well as concatenation.
Binaries larger than 42 codepoints are managed differently by the BEAM to prevent unnecessary copying. Lists do not have any special feature for this, and are always sent in their entirety to another process, even if they are very large.
Access to either a single byte or a whole subsequence of a Binary can be obtained in constant-time, whereas for lists this can only happen in linear time, as you don’t know how long the list is nor how to reach elements further at the back before you traverse it.

OvermindDL1 · August 5, 2016, 2:24pm

The whole ‘head’ has to be rebuilt; as they are cons pairs anything after the changed element can just be pointed to again verbatim, everything before has to be rebuilt.

Qqwy · August 5, 2016, 3:04pm

Thanks! Astutely observed. I have edited the post with this fix.

rvirding · August 6, 2016, 11:37pm

Actually not quite:

While the binaries may only contain integers they can contain any integers so there is no guarantee that they contain proper utf-8 encoded codepoints which Elixir strings are defined to be.

This is generally true although the difference is not that great as a utf-8 encoded character can be 1-4 bytes large.

You would think so, but there was a post a few years ago on the erlang mailing list where a guy had measured doing string operations on list strings and binary strings and found that in many cases they were faster on list strings. This surprised many. The reason for this was that most operations on binary strings resulted in much more copying than on list strings.

A pathological example of this is adding or removing one character to the front of a string which results in the whole binary string being copied.

While this is true for plain binaries it does not hold for utf-8 encoded binary strings. As each codepoint can be 1-4 bytes long you still have to traverse the binary string to access a single character. A list string has the benefit that you do not need to look at each character as you traverse it.

So actually it is much more habit which determines what you choose. In some ways the “best” is a nested list/characters/binary structure which can be very fast to work with. Concatenating string1 and string2 can be done with [string1|string2]. The code can be a bit tricky though.

Qqwy · August 7, 2016, 12:37am

Thank you very much for clearing those things up, @rvirding!

Yes, this is true. Not knowing if a binary is a proper string or not before looping through all elements makes inspecting the string a little slower.

What binaries do enforce, is that all values inside are bytes, so ‘any’ integers is not entirely true. If you want to put integers larger than 255 inside, you need to specify the size using i.e. <<1,2, 257::size(8)-unit(2), 3>>. If you want to put signed integers, floats, etc. inside, (or want to read bytes stored in that format out to that format), you need to specify them using a size formatter as well.

Lists on the other hand might contain any integers, including bignums. As well as any other value, of course.

if I correctly understand how UTF-8 list strings are built, a four-byte UTF-8 encoded character will be represented in list string format as four separate elements as well. Or am I wrong?

Your other two arguments (copying measured as taking a long time, codepoint-access being linear for UTF-8 strings regardless of format) are indeed very true. I would love to see the work of the person who measured the performance, just to see in what cases one is better over the other.

The nested tree-like structure that e.g. IOlists provide indeed is nice when appending. However, accessing elements in a created IOlist then again takes longer, because we need to do a dept-first traversal of the resulting tree (having the non-terminal nodes as extra overhead over a normal linear data structure). Each has its uses…