Leex regex problem

kokolegorille · April 10, 2018, 9:01pm

Hello everyone,

I am having some trouble understanding leex behaviour. To give some context, I would like to create a parser for pgn files.

Here is some code to explain my problem.

I created the lexer file pgn_lexer.xrl with

% -- Definitions.

Definitions.

TAG        = (\[[^\]]*\])+
COMMENT    = ({[^}]*})+
WHITESPACE = [\s\t\n\r]

Rules.

{TAG}         : {token, {tag, TokenLine, TokenChars}}.
{COMMENT}     : {token, {comment, TokenLine, TokenChars}}.
{WHITESPACE}+ : skip_token.

Erlang code.

The important rule is for TAG

( \[ [^\]] * \] )+

I cannot understand why

iex> :pgn_lexer.string '[Tag1 "Value1"] [Tag2 "Value2"]{comment}'                                                        
{:ok,
 [
   {:tag, 1, '[Tag1 "Value1"]'},
   {:tag, 2, '[Tag2 "Value2"]'},
   {:comment, 2, '{comment}'}
 ], 2}
# This works has expected, it detects 2 tags

iex> :pgn_lexer.string '[Tag1 "Value1"][Tag2 "Value2"]{comment}'  
{:ok, [{:tag, 1, '[Tag1 "Value1"][Tag2 "Value2"]'}, {:comment, 1, '{comment}'}],
 1}
# This does not work, tag1 and tag2 are merged

Why do the tags need to be separated by a space?

Thanks for enlightments.

kip · April 10, 2018, 9:22pm

The reason is that regexes in Leex are greedy. Your regex in this case matches the longest string possible. Here’s a example using your regex and data in IEx:

iex> r = ~r/(\[[^\]]*\])+/
~r/(\[[^\]]*\])+/

iex> String.split "[Tag1 \"Value1\"] [Tag2 \"Value2\"]{comment}", r, include_captures: true
["", "[Tag1 \"Value1\"]", " ", "[Tag2 \"Value2\"]", "{comment}"]

iex> String.split "[Tag1 \"Value1\"][Tag2 \"Value2\"]{comment}", r, include_captures: true 
["", "[Tag1 \"Value1\"][Tag2 \"Value2\"]", "{comment}"]

In general I find these issues become prevalent when you’re using Leex to parse instead of just tokenize. (I know this because I’ve spent hours on similar cases myself). My learning is to use Leex to tokenise and Yecc to parse.

Or look at nimble_parsec or ex_spirit

kokolegorille · April 10, 2018, 9:25pm

Thank You for the answer, now I know why, and I know why I am spending hours too

NobbZ · April 10, 2018, 9:48pm

Yes, they are greedy, but thats not the cause here.

TAG is defined as (\[[^\]]*\])+ and thus we demand at least one but allow many repitions of square-bracket-pairs-with-stuff-inbetween and consolidate them into one token.

If the second example is expected to spit out 2 tag tokens, then TAG should be just (\[[^\]]*\]) (without the +), now we should even be able to remove the grouping parens without changing anything.

kokolegorille · April 10, 2018, 9:52pm

That did the trick, thank You very much.

Here is the modified version…

TAG        = (\[[^\]]*\])

iex> :pgn_lexer.string '[Tag1 "Value1"][Tag2 "Value2"]{comment}' 
{:ok,
 [
   {:tag, 1, '[Tag1 "Value1"]'},
   {:tag, 1, '[Tag2 "Value2"]'},
   {:comment, 1, '{comment}'}
 ], 1}

And as mentionned, the short version works as well

TAG        = \[[^\]]*\]

kip · April 10, 2018, 10:00pm

Arrrggghhhhh sorry for my misleading reply.