# Parser Combinators how to know when many should return an error

I’ve rolled my own parser combinator library, Ergo, following the work of particularly Saśa Jurić, and it’s got to the point where it’s good enough to do real work. There are still corners and at least a couple of performance issues but the thing is solid enough.

Now I’ve hit a snag that I hope someone more familiar with parser combinators has dealt with before and can shine a light on the path ahead.

What it boils down to is this:

The `many(p)` combinator matches `p` zero or more times.`p` returning an error is the signal to `many` that its work is done, so many always return success to its parent parser.

I’ve identified two different situations in which this typically occurs:

1. The legitimate end of a sequence: `p` should not match
2. Early termination due to an invalid input: `p` should have matched

In the second case when many rewinds the input its leaving it in a state that the following parsers will not expect and parsing will likely fail but not at the real source of the problem.

I’ll try and clarify. If the input was “abcbcbcd” to a parser:

``````sequence([
char(?a),
many(
sequence([
char(?b),
char(?c),
])
),
char(?d)
])
``````

The `many` would consume the “bc” pairs until it reached the “d”, i.e. `char(?b)` returns an error. At which point `sequence` parsing would fail and the `many` would return ok with the “bc” pairs it had matched as an AST. The input would be rewound to “d” and parsing would continue with `char(?d)` which then matches.

Now let’s imagine the user erroneously writes “abcbbcd” as their input, they forgot one of those "c"s. Now the `sequence` parsing “bc” would fail when it hit a “b” while expecting a “c”. The `many` now fails and rewinds the input to “b”.

Now `char(?d)` attempts to match “b” and fails.

Of course, we expect a failure but the parser appears to be failing at `char(?d)` with “unexpected character: b expecting d” whereas the real problem is back inside the `many` and should have failed with “unexpected character: b expecting c”. This makes debugging or reporting errors to the user difficult.

The problem is that `many` is unable to disambiguate its child parser `p` failing due to “end of sequence” vs. `p` failing on an input that should be parsed there.

My current thinking is to add a meta-parser called something like `commit` that says “when we reach this point we’ve got something we should be able to parse. If this fails it’s not an end of sequence but a real error” and somehow `many` then knows to return the right error at the right point.

At this point, it’s probably worth mentioning that Ergo passes around a `%Context{}` structure that holds the remaining input and maintains line, column, and parsing status. A hypothetical `commit` parser could be setting a flag in the context.

I’d welcome any insight into this situation, thank you.

Matt

3 Likes

So `many()` being “zero or more” essentially means this is an optional match. Your goal seems to be making it clear that either parsing `?d` was the issue if `many()` was indeed meant to be empty, or parsing the `many()` step failed for some reason. So how about:

``````either(
sequence([
many(
sequence([
char(?b),
char(?c),
])
),
char(?d)
]),
char(?d)
)
``````

Which will communicate the two different states explicitly to the parser. If both branches fail the error can communicate both expectations instead of just one.

2 Likes

We have encountered the same issue when developing an SQL parser. Basically, my takeaway was that parsers such as `many` and `choice` are mostly useless if we want to produce a precise error report. Perhaps there is a way to make it work, I’m not super familiar with various parsing techniques, so I’ll explain what we did instead.

We had multiple variations of the problem you describe here, e.g. when parsing arithmetic expressions, or handling joins and subqueries. Our strategy was to abandon `many` and `choice`. Instead, we would “commit” to a parsing path based on the next token. In your example grammar, if the next token is `b`, we’ll commit to parsing the pair `bc`. If that fails, we’ll emit an error. If that succeeds, we’ll try to parse another `bc` if the next token is `b`. I believe that this technique is called “lookahead”.

To make it work, we introduced two additional parsers. The most important one is called `switch`. Let’s see an example:

``````switch([
{char(?b), char(?c)},
{char(?d), char(?e)},
])
``````

This will parse `bc` or `de`, returning a proper error if the second char is invalid. Basically the switch parser commits to the given “branch” if the first parser in some tuple succeeds. If the rest of that branch errors, that error is returned by the switch parser. If no branch can be selected, the parser will return a generic error (similar to `choice`). If the parser succeeds, it will emit `{result_of_the_first_parser, result_of_the_second_parser}`

Now in your example we don’t want to return an error if no branch can be selected. Instead, we want to just stop parsing. To make this happen, we added the support for the `:else` clause, and introduced another parser called `noop` which always succeeds consuming nothing from the input. So parsing `bc` or nothing can be now expressed as:

``````switch([
{char(?b), char(?c)},
{:else, noop()}
])
``````

And now, we can parse zero or more bc pairs as:

``````defp bc_pairs do
switch([
{char(?b), sequence([char(?c), lazy(&bc_pairs/0)])},
{:else, noop()}
])
end
``````

The emitted term will be a bit weird, something like:

``````{?b, [?c, {?b, [?c, nil]}]}
``````

The final `nil` is emitted by the `noop` parser. This can be transformed into a list (e.g. `["bc", "bc", ...]`) with the `map` parser.

I’m not sure if there’s a more elegant way, but this is what we did, and it worked fine. The code was definitely more complicated than the naive approach with `many`, but it was still understandable. Most importantly, our error reports were much more precise and informative.

For completness, we also wrote a parser called `choice_deepest_error`, which would work like a `choice` (aka `one_of`):

``````choice_deepest_error([
parser1(),
parser2(),
...
])
``````

This parser emits either the result of parser1, parser2, etc. If no parser succeeds, it returns the error from the parser that consumed most of the input. I personally wasn’t a fan of this approach, because I found it hard to reason about. The “deepest error” isn’t necessarily the correct error. So personally I prefer the lookahead technique.

3 Likes

Thanks for thinking it through.

It’s possible that I am missing something in what you’re saying — I feel I am quite close to the problem and that has its own challenges but it seems to me that the focus on the zero-or-more semantics of `many` here is possibly a misdirect.

If we imagine a parser like:

``````program = many(elixir_statement)
elxir_statement = choice([case_expr, for_expr, def_expr, with_expr, …])
case_expr = sequence([literal("case"), expr, literal("do"), …, literal("end")])
# and so on
``````

if the input is

case x# do … end

then the `case_expr` will fail but it’s not the case that `for_expr`, `with_expr`, `def_expr` or anything could match. From the moment it parsed "case " there was no other valid option and something is wrong with the input.

So the `case_expr` fails, causing the `choice` to fail and that rolls up into the `many` which expects its child parser to fail at some point because it treats that as “end of sequence” and then it returns success.

My problem is how to differentiate between the child parser of `many` failing in the expected “end of sequence” case or the unexpected “malformed choice” case because the error is different in each.

It seems Saša & co. didn’t attempt to solve this but rather worked around it through writing parsers differently.

Looking at your suggestion it seems to me we don’t want to attempt `char(?d)` because the input is invalid and `char(?d)` cannot match. It might be that some of these errors could be recovered to continue parsing (in my example of a `#` in an expression we could clearly attempt to continue parsing since the structure is sound but in other cases, it may not be). What I am trying to do is report the right error (i.e. we need to bubble an error out of `many` when that’s not its typical behaviour).

Thanks.

Matt

1 Like

Thanks for explaining how you approached it. I’m loathe to abandon `choice`/`many` at this juncture as my parsers read quite cleanly with them so I will wrestle with this a little longer but along a similar path.

I’m thinking that by introducing some kind of `commit` meta parser I can bubble the nature of the error back up to the `many` and have it return `{:error, _}` instead of `:ok` in this case.

Something like:

``````def actor_block() do
sequence([
literal("actor"),
commit(),
…
])
end
``````

Here the `sequence`, if it succeeds, clears the commit, otherwise it bubbles it up. When `many` receives an error that has a commit, instead of treating it as “end of sequence” and returning `{:ok, ast}` it treats it as ‘malformed input’ returns the error.

Perhaps this will be too complex or not cover all the cases but it seems to me very implicated in many/sequence situations so I will give it a try.

I’ll report back.

Thanks.

1 Like

Yeah, I think this is an interesting idea that would allow us to do what we want without abandoning the otherwise convenient `choice` and `many` parsers. One possible challenge I see is nesting (what if I nest a couple of `many` & `choice` parsers with different commits). Also, what should happen if I commit outside of a “committable” parser? Should an exception be raised? In any case, I’m curious to see how will it work out in practice.

1 Like

I have just hit the nesting problem. I’m not sure what the answer is to that but, off-hand I think a commit stack might work since there can only be one current path for the parser. I think.

As to using commit outside of a committable parser, I hadn’t considered that possibility yet. I probably can detect this because my parsers are actually `%Parser{}` structs. I already check that children are valid parsers so I could probably also check “and not a commit unless I am sequence”. Bit of a pain, let’s see.

1 Like

In my own use, the only time I can think a commit makes sense is within a `sequence` (here Ergo differs from some other approaches in that I don’t `|>` pipe parsers to create sequences but have an explicit `sequence` parser that takes a list of parsers to match in turn).

Raising an exception would seem to make sense in this case and it turns out was easy to implement within Ergo so that’s what happens now. Assuming the approach works at all!

1 Like

Yeah, but `commit` only makes sense if somewhere up the callstack there is a parser such as `many` or `choice` that can interpret the commit, right?

1 Like

In fact, the approach I had taken was getting very busy with housekeeping and I’m trying another. In this case, the `commit` parser is actually a sentinel value that the `sequence` parser looks for and, once it sees it, knows to interpret further errors differently.

Now `commit` is still a valid parser, but it’s implemented as a NOP that returns the context unchanged so if used in an inappropriate place it does no harm.

1 Like

Okay I have an implementation that seems to be working for my basic tests.

It’s possible that more complex, multi-layered, tests will break something. However the approach I have taken here constrains the ‘commit in progress state’ to inside a given sequence processing. I think that should be good enough. Time will tell.

The core change is in the heart of my implementation of the `sequence` parser which takes a list of parsers to be matched in turn (Ergo doesn’t typically use |> to chain parsers together):

``````defp sequence_reduce(parsers, %Context{} = ctx) when is_list(parsers) do
ctx =
ctx
|> Context.set_ast([])
|> Context.set_committed(false)

{_, seq_ctx} = Enum.reduce_while(parsers, {false, ctx}, fn
%{type: :commit}, {_, ctx} ->
# We don't need to invoke the commit parser, just recognise that
# we have seen it
IO.puts("COMMIT")
{:cont, {true, ctx}}

parser, {committed, ctx} ->
case {Parser.invoke(ctx, parser), committed} do
{%Context{status: :ok, ast: child_ast} = ok_ctx, _} ->
# Return the new CTX but prepend the child AST to the old AST
{:cont, {committed, %{ok_ctx | ast: [child_ast | ctx.ast]}}}
# {:cont, {committed, seq_pass(ok_ctx, ctx)}}

{%Context{status: {:error, _}} = err_ctx, false} ->
# Regular error, halt and report back
{:halt, {false, err_ctx}}

{%Context{status: {:error, _}} = fatal_ctx, true} ->
IO.puts("FATAL")
{:halt, {true, Context.set_committed(fatal_ctx, true)}}
end
end)

seq_ctx
end
``````

I’ve ammended the `reduce_while` that traverses the list of parsers. It’s state now contains a “committed” value which is set when the sequence see’s the `commit` parser in its children. That parser is a NOP parser which is treated as a sentinel value.

Once a given `sequence` has seen `commit` it treats errors differently by setting the `committed` flag to true in the `%Context{}` structure parsers pass around.

The implementations of `choice` and `optional` now check if the `committed` flag on error. If it’s true they halt with the error instead of doing their default behaviour (moving on to the next choice or returning success respectively).

The implementation of `many` checks for `committed` being true when an error occurs and returns the error instead of treating as “end of sequence” (and consequently returning success).

Assuming this passes muster with a few more tests it’ll go into Ergo 0.9.5.

1 Like

I spoke too soon of course. There’s a wrinkle with, so far with nested `many` parsers, so that it’s still potentially reporting the error in the wrong location. I’ll have to think about it.