Can I configure Erlang leex module to accept Unicode input?

Aetherus · September 23, 2021, 3:36am

Hi, all.

I’m recently assigned a task to convert some custom DSL to some sort of custom AST. The first step I’m trying is tokenizing the DSL using Erlang’s leex module (not the “Live EEX” in Phoenix). You can find the documentation here: Erlang -- leex

The problem is that the DSL contains Unicode characters, and whenever I try to tokenize a piece of the DSL code, leex complains about the input is not iodata.

src/my_lexer.xrl

Definitions.

STRING = "[^"]*"

Rules.

{STRING} : {token, {string, TokenLine, token_to_string(TokenChars)}}.

Erlang code.

token_to_string([$"|Chars]) ->
  [$"|Reversed] = lists:reverse(Chars),
  iolist_to_binary(lists:reverse(Reversed)).

iex -S mix

iex> :my_lexer.string('"hello"')
{:ok, {:string, 1, "hello"}, 1}

iex> :my_lexer.string('"你好"')
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: not an iodata term

    :erlang.iolist_to_binary([20320, 22909])
    ./my_lexer.xrl:13: :my_lexer.token_to_string/1
    ./my_lexer.xrl:7: :my_lexer.yyaction/4
    /opt/homebrew/Cellar/erlang/24.0.5/lib/erlang/lib/parsetools-2.3/include/leexinc.hrl:34: :my_lexer.string/4

I noticed that if I compile the regex in unicode mode, it works.

iex> {:ok, regex} = :re.compile('"[^"]*"', [:unicode])
iex> :re.run('"你好"', regex)
{:match, [{0, 8}]}

But it panics when the regex is not compiled in unicode mode

iex> {:ok, regex} = :re.compile('"[^"]*"')
iex> :re.run('"你好"', regex)
** (ArgumentError) errors were found at the given arguments:

  * 1st argument: not an iodata term

    (stdlib 3.15.2) :re.run([34, 20320, 22909, 34], {:re_pattern, 0, 0, 0, <<69, 82, 67, 80, 77, 0, 0, 0, 0, 0, 0, 0, 81, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 34, 0, 34, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>})

I guess that’s the cause of the leex problem, but I don’t know how to configure leex so that it compiles regular expressions in unicode mode.

For now, I just escape all non-ASCII characters to things like '\\u1234', and convert them back when tokenizing strings, but this approach messes up the locations of the tokens, and may generate inaccurate error messages. Are there any better solutions?

Thanks!

kip · September 23, 2021, 4:54am

The issue here isn’t the regex (and I don’t think leex uses the :re engine, I think iirc that it has its own engine). The issue is that iodata can only be:

iodata()	iolist() | binary()
iolist()	maybe_improper_list(byte() | binary() | iolist(), binary() | [])

So your call to iolist_to_binary/1 is what is failing. You can use unicode:characters_to_binary/1 instead, like this:

token_to_string([$"|Chars]) ->
  [$"|Reversed] = lists:reverse(Chars),
  unicode:characters_to_binary(lists:reverse(Reversed)).

I tested on my machine and that appears to work with your input.

Aetherus · September 23, 2021, 5:53am

Thank you so much! It works