Parser Combinators how to know when many should return an error

mattmower · February 15, 2022, 9:23am

I’ve rolled my own parser combinator library, Ergo, following the work of particularly Saśa Jurić, and it’s got to the point where it’s good enough to do real work. There are still corners and at least a couple of performance issues but the thing is solid enough.

Now I’ve hit a snag that I hope someone more familiar with parser combinators has dealt with before and can shine a light on the path ahead.

What it boils down to is this:

The many(p) combinator matches p zero or more times.p returning an error is the signal to many that its work is done, so many always return success to its parent parser.

I’ve identified two different situations in which this typically occurs:

The legitimate end of a sequence: p should not match
Early termination due to an invalid input: p should have matched

In the second case when many rewinds the input its leaving it in a state that the following parsers will not expect and parsing will likely fail but not at the real source of the problem.

I’ll try and clarify. If the input was “abcbcbcd” to a parser:

sequence([
  char(?a),
  many(
    sequence([
      char(?b),
      char(?c),
    ])
  ),
  char(?d)
])

The many would consume the “bc” pairs until it reached the “d”, i.e. char(?b) returns an error. At which point sequence parsing would fail and the many would return ok with the “bc” pairs it had matched as an AST. The input would be rewound to “d” and parsing would continue with char(?d) which then matches.

Now let’s imagine the user erroneously writes “abcbbcd” as their input, they forgot one of those "c"s. Now the sequence parsing “bc” would fail when it hit a “b” while expecting a “c”. The many now fails and rewinds the input to “b”.

Now char(?d) attempts to match “b” and fails.

Of course, we expect a failure but the parser appears to be failing at char(?d) with “unexpected character: b expecting d” whereas the real problem is back inside the many and should have failed with “unexpected character: b expecting c”. This makes debugging or reporting errors to the user difficult.

The problem is that many is unable to disambiguate its child parser p failing due to “end of sequence” vs. p failing on an input that should be parsed there.

I’ve written a longer post about this that has a more realistic example.

My current thinking is to add a meta-parser called something like commit that says “when we reach this point we’ve got something we should be able to parse. If this fails it’s not an end of sequence but a real error” and somehow many then knows to return the right error at the right point.

At this point, it’s probably worth mentioning that Ergo passes around a %Context{} structure that holds the remaining input and maintains line, column, and parsing status. A hypothetical commit parser could be setting a flag in the context.

I’d welcome any insight into this situation, thank you.

Matt

LostKobrakai · February 15, 2022, 10:25am

So many() being “zero or more” essentially means this is an optional match. Your goal seems to be making it clear that either parsing ?d was the issue if many() was indeed meant to be empty, or parsing the many() step failed for some reason. So how about:

either(
  sequence([
    many(
      sequence([
        char(?b),
        char(?c),
      ])
    ),
    char(?d)
  ]),
  char(?d)
)

Which will communicate the two different states explicitly to the parser. If both branches fail the error can communicate both expectations instead of just one.

sasajuric · February 15, 2022, 10:56am

We have encountered the same issue when developing an SQL parser. Basically, my takeaway was that parsers such as many and choice are mostly useless if we want to produce a precise error report. Perhaps there is a way to make it work, I’m not super familiar with various parsing techniques, so I’ll explain what we did instead.

We had multiple variations of the problem you describe here, e.g. when parsing arithmetic expressions, or handling joins and subqueries. Our strategy was to abandon many and choice. Instead, we would “commit” to a parsing path based on the next token. In your example grammar, if the next token is b, we’ll commit to parsing the pair bc. If that fails, we’ll emit an error. If that succeeds, we’ll try to parse another bc if the next token is b. I believe that this technique is called “lookahead”.

To make it work, we introduced two additional parsers. The most important one is called switch. Let’s see an example:

switch([
  {char(?b), char(?c)},
  {char(?d), char(?e)},
])

This will parse bc or de, returning a proper error if the second char is invalid. Basically the switch parser commits to the given “branch” if the first parser in some tuple succeeds. If the rest of that branch errors, that error is returned by the switch parser. If no branch can be selected, the parser will return a generic error (similar to choice). If the parser succeeds, it will emit {result_of_the_first_parser, result_of_the_second_parser}

Now in your example we don’t want to return an error if no branch can be selected. Instead, we want to just stop parsing. To make this happen, we added the support for the :else clause, and introduced another parser called noop which always succeeds consuming nothing from the input. So parsing bc or nothing can be now expressed as:

switch([
  {char(?b), char(?c)},
  {:else, noop()}
])

And now, we can parse zero or more bc pairs as:

defp bc_pairs do
  switch([
    {char(?b), sequence([char(?c), lazy(&bc_pairs/0)])},
    {:else, noop()}
  ])
end

The emitted term will be a bit weird, something like:

{?b, [?c, {?b, [?c, nil]}]}

The final nil is emitted by the noop parser. This can be transformed into a list (e.g. ["bc", "bc", ...]) with the map parser.

I’m not sure if there’s a more elegant way, but this is what we did, and it worked fine. The code was definitely more complicated than the naive approach with many, but it was still understandable. Most importantly, our error reports were much more precise and informative.

For completness, we also wrote a parser called choice_deepest_error, which would work like a choice (aka one_of):

choice_deepest_error([
  parser1(),
  parser2(),
  ...
])

This parser emits either the result of parser1, parser2, etc. If no parser succeeds, it returns the error from the parser that consumed most of the input. I personally wasn’t a fan of this approach, because I found it hard to reason about. The “deepest error” isn’t necessarily the correct error. So personally I prefer the lookahead technique.

mattmower · February 15, 2022, 2:11pm

Thanks for thinking it through.

It’s possible that I am missing something in what you’re saying — I feel I am quite close to the problem and that has its own challenges but it seems to me that the focus on the zero-or-more semantics of many here is possibly a misdirect.

If we imagine a parser like:

program = many(elixir_statement)
elxir_statement = choice([case_expr, for_expr, def_expr, with_expr, …])
case_expr = sequence([literal("case"), expr, literal("do"), …, literal("end")])
# and so on

if the input is

case x# do … end

then the case_expr will fail but it’s not the case that for_expr, with_expr, def_expr or anything could match. From the moment it parsed "case " there was no other valid option and something is wrong with the input.

So the case_expr fails, causing the choice to fail and that rolls up into the many which expects its child parser to fail at some point because it treats that as “end of sequence” and then it returns success.

My problem is how to differentiate between the child parser of many failing in the expected “end of sequence” case or the unexpected “malformed choice” case because the error is different in each.

It seems Saša & co. didn’t attempt to solve this but rather worked around it through writing parsers differently.

Looking at your suggestion it seems to me we don’t want to attempt char(?d) because the input is invalid and char(?d) cannot match. It might be that some of these errors could be recovered to continue parsing (in my example of a # in an expression we could clearly attempt to continue parsing since the structure is sound but in other cases, it may not be). What I am trying to do is report the right error (i.e. we need to bubble an error out of many when that’s not its typical behaviour).

Thanks.

Matt

mattmower · February 15, 2022, 2:19pm

Thanks for explaining how you approached it. I’m loathe to abandon choice/many at this juncture as my parsers read quite cleanly with them so I will wrestle with this a little longer but along a similar path.

I’m thinking that by introducing some kind of commit meta parser I can bubble the nature of the error back up to the many and have it return {:error, _} instead of :ok in this case.

Something like:

def actor_block() do
  sequence([
    literal("actor"),
    commit(),
    …
  ])
end

Here the sequence, if it succeeds, clears the commit, otherwise it bubbles it up. When many receives an error that has a commit, instead of treating it as “end of sequence” and returning {:ok, ast} it treats it as ‘malformed input’ returns the error.

Perhaps this will be too complex or not cover all the cases but it seems to me very implicated in many/sequence situations so I will give it a try.

I’ll report back.

Thanks.

sasajuric · February 15, 2022, 3:05pm

Yeah, I think this is an interesting idea that would allow us to do what we want without abandoning the otherwise convenient choice and many parsers. One possible challenge I see is nesting (what if I nest a couple of many & choice parsers with different commits). Also, what should happen if I commit outside of a “committable” parser? Should an exception be raised? In any case, I’m curious to see how will it work out in practice.

mattmower · February 15, 2022, 3:35pm

I have just hit the nesting problem. I’m not sure what the answer is to that but, off-hand I think a commit stack might work since there can only be one current path for the parser. I think.

As to using commit outside of a committable parser, I hadn’t considered that possibility yet. I probably can detect this because my parsers are actually %Parser{} structs. I already check that children are valid parsers so I could probably also check “and not a commit unless I am sequence”. Bit of a pain, let’s see.

mattmower · February 15, 2022, 4:49pm

In my own use, the only time I can think a commit makes sense is within a sequence (here Ergo differs from some other approaches in that I don’t |> pipe parsers to create sequences but have an explicit sequence parser that takes a list of parsers to match in turn).

Raising an exception would seem to make sense in this case and it turns out was easy to implement within Ergo so that’s what happens now. Assuming the approach works at all!

sasajuric · February 15, 2022, 5:26pm

Yeah, but commit only makes sense if somewhere up the callstack there is a parser such as many or choice that can interpret the commit, right?

mattmower · February 16, 2022, 9:59am

In fact, the approach I had taken was getting very busy with housekeeping and I’m trying another. In this case, the commit parser is actually a sentinel value that the sequence parser looks for and, once it sees it, knows to interpret further errors differently.

Now commit is still a valid parser, but it’s implemented as a NOP that returns the context unchanged so if used in an inappropriate place it does no harm.

mattmower · February 16, 2022, 1:00pm

Okay I have an implementation that seems to be working for my basic tests.

It’s possible that more complex, multi-layered, tests will break something. However the approach I have taken here constrains the ‘commit in progress state’ to inside a given sequence processing. I think that should be good enough. Time will tell.

The core change is in the heart of my implementation of the sequence parser which takes a list of parsers to be matched in turn (Ergo doesn’t typically use |> to chain parsers together):

defp sequence_reduce(parsers, %Context{} = ctx) when is_list(parsers) do
    ctx =
      ctx
      |> Context.set_ast([])
      |> Context.set_committed(false)

    {_, seq_ctx} = Enum.reduce_while(parsers, {false, ctx}, fn
      %{type: :commit}, {_, ctx} ->
        # We don't need to invoke the commit parser, just recognise that
        # we have seen it
        IO.puts("COMMIT")
        {:cont, {true, ctx}}

      parser, {committed, ctx} ->
        case {Parser.invoke(ctx, parser), committed} do
          {%Context{status: :ok, ast: child_ast} = ok_ctx, _} ->
            # Return the new CTX but prepend the child AST to the old AST
            {:cont, {committed, %{ok_ctx | ast: [child_ast | ctx.ast]}}}
            # {:cont, {committed, seq_pass(ok_ctx, ctx)}}

          {%Context{status: {:error, _}} = err_ctx, false} ->
            # Regular error, halt and report back
            {:halt, {false, err_ctx}}

          {%Context{status: {:error, _}} = fatal_ctx, true} ->
            IO.puts("FATAL")
            {:halt, {true, Context.set_committed(fatal_ctx, true)}}
        end
    end)

    seq_ctx
  end

I’ve ammended the reduce_while that traverses the list of parsers. It’s state now contains a “committed” value which is set when the sequence see’s the commit parser in its children. That parser is a NOP parser which is treated as a sentinel value.

Once a given sequence has seen commit it treats errors differently by setting the committed flag to true in the %Context{} structure parsers pass around.

The implementations of choice and optional now check if the committed flag on error. If it’s true they halt with the error instead of doing their default behaviour (moving on to the next choice or returning success respectively).

The implementation of many checks for committed being true when an error occurs and returns the error instead of treating as “end of sequence” (and consequently returning success).

Assuming this passes muster with a few more tests it’ll go into Ergo 0.9.5.

mattmower · February 16, 2022, 8:57pm

I spoke too soon of course. There’s a wrinkle with, so far with nested many parsers, so that it’s still potentially reporting the error in the wrong location. I’ll have to think about it.