Hi, all.
I’m recently assigned a task to convert some custom DSL to some sort of custom AST. The first step I’m trying is tokenizing the DSL using Erlang’s leex
module (not the “Live EEX” in Phoenix). You can find the documentation here: Erlang -- leex
The problem is that the DSL contains Unicode characters, and whenever I try to tokenize a piece of the DSL code, leex
complains about the input is not iodata
.
src/my_lexer.xrl
Definitions.
STRING = "[^"]*"
Rules.
{STRING} : {token, {string, TokenLine, token_to_string(TokenChars)}}.
Erlang code.
token_to_string([$"|Chars]) ->
[$"|Reversed] = lists:reverse(Chars),
iolist_to_binary(lists:reverse(Reversed)).
iex -S mix
iex> :my_lexer.string('"hello"')
{:ok, {:string, 1, "hello"}, 1}
iex> :my_lexer.string('"你好"')
** (ArgumentError) errors were found at the given arguments:
* 1st argument: not an iodata term
:erlang.iolist_to_binary([20320, 22909])
./my_lexer.xrl:13: :my_lexer.token_to_string/1
./my_lexer.xrl:7: :my_lexer.yyaction/4
/opt/homebrew/Cellar/erlang/24.0.5/lib/erlang/lib/parsetools-2.3/include/leexinc.hrl:34: :my_lexer.string/4
I noticed that if I compile the regex in unicode mode, it works.
iex> {:ok, regex} = :re.compile('"[^"]*"', [:unicode])
iex> :re.run('"你好"', regex)
{:match, [{0, 8}]}
But it panics when the regex is not compiled in unicode mode
iex> {:ok, regex} = :re.compile('"[^"]*"')
iex> :re.run('"你好"', regex)
** (ArgumentError) errors were found at the given arguments:
* 1st argument: not an iodata term
(stdlib 3.15.2) :re.run([34, 20320, 22909, 34], {:re_pattern, 0, 0, 0, <<69, 82, 67, 80, 77, 0, 0, 0, 0, 0, 0, 0, 81, 0, 0, 0, 255, 255, 255, 255, 255, 255, 255, 255, 34, 0, 34, 0, 0, 0, 0, 0, 0, 0, 64, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...>>})
I guess that’s the cause of the leex problem, but I don’t know how to configure leex so that it compiles regular expressions in unicode mode.
For now, I just escape all non-ASCII characters to things like '\\u1234'
, and convert them back when tokenizing strings, but this approach messes up the locations of the tokens, and may generate inaccurate error messages. Are there any better solutions?
Thanks!