Why do macro sigils accept a bitstring literal instead of a string literal?

garrison · September 17, 2023, 7:24pm

The first time I tried to create a custom (macro) sigil, I assumed the first argument would be a string as I knew string literals represented themselves in the AST. Of course, a quick look into the Elixir source reveals this is not the case; sigils accept a strange tuple as their first argument. I simply copied this without knowing what was actually going on.

Today, I was trying to write a sigil to hijack Phoenix’s HEEx as part of a silly hack that will no doubt warrant its own post in the near future. As a result of that, I spent some time trying to figure out how sigils actually work.

My Elixir metaprogramming knowledge is unfortunately not the best, but as I understand it, this tuple:

{:<<>>, [], ["hello world"]}

is the AST representation of a bitstring literal.

So when we write a custom sigil:

defmacro sigil_H({:<<>>, _meta, [expr]}, []) do
  # whatever
end

That custom sigil accepts a bitstring literal.

The first thing that threw me off here was that, as I understand it, these AST tuples are structured as {operator, metadata, arguments}. So the argument for this bitstring is, well, a string. However, I see from the docs that this is in fact valid syntax for a bitstring:

iex> quote do: <<"hello world">>
{:<<>>, [], ["hello world"]}

Go figure.

Further experimentation reveals that sigil calls are parsed into this:

iex> quote do: ~S"hello world"
{
  :sigil_S,
  [delimiter: "\"", context: Elixir, imports: [{2, Kernel}]],
  [{:<<>>, [], ["hello world"]}, []]
}

Again, my understanding is that calls are represented in the AST as {name, metadata, arguments}, so in this case sigil_S is indeed being called with two arguments: a bitstring literal, and an empty list. Which is exactly what the macros pattern match on, so everything checks out.

Which means that a sigil call is effectively parsed into (the AST representation of) this code:

sigil_S(<<"hello world">>, [])

So my question is this: why do sigils accept the AST representation of a bitstring, which contains a string, instead of just accepting a string? As far as I can tell, this is not mentioned in the documentation. In fact, I can’t seem to find any mention of it on the internet at all, hence my post.

My guess is the answer has something to do with the metadata list included in the bitstring’s AST tuple, which is indeed used by HEEx to get indentation information, and which is absent from the AST representation of a string literal.

Eiji · September 17, 2023, 7:46pm

In Elixir the String is an UTF-8 encoded binary

A UTF-8 encoded binary.

The types String.t() and binary() are equivalent to analysis tools. Although, for those reading the documentation, String.t() implies it is a UTF-8 encoded binary.
Source: t:String.t/0

Example code for String and it’s binary representation:

# less sugar
iex> <<"test"::binary>> == <<"test">>
true

# more sugar
iex> <<"test"::binary>> == "test"
true

# UTF-8 encoding
iex> for <<char::utf8 <- "test">>, do: <<char::utf8>>
["t", "e", "s", "t"]

Instead of looking at kernel special forms you should read the Syntax reference, see: Lists, tuples and binaries section.

josevalim · September 17, 2023, 7:50pm

Correct. Strings are bitstrings internally and the sigil macro also receives them in the bitstring syntax for consistency/simplicity, otherwise you would have to match on both.

It also has the upside of including a meta, which may carry additional information.

garrison · September 17, 2023, 8:12pm

Yes, I am familiar with this - my question was solely to do with the AST representation of the code which is passed to the sigil macros. At runtime they would be equivalent, as I understand it.

I believe this answers my question. Thanks to you both!