I’ve rolled my own parser combinator library, Ergo, following the work of particularly Saśa Jurić, and it’s got to the point where it’s good enough to do real work. There are still corners and at least a couple of performance issues but the thing is solid enough.
Now I’ve hit a snag that I hope someone more familiar with parser combinators has dealt with before and can shine a light on the path ahead.
What it boils down to is this:
The many(p)
combinator matches p
zero or more times.p
returning an error is the signal to many
that its work is done, so many always return success to its parent parser.
I’ve identified two different situations in which this typically occurs:
- The legitimate end of a sequence:
p
should not match - Early termination due to an invalid input:
p
should have matched
In the second case when many rewinds the input its leaving it in a state that the following parsers will not expect and parsing will likely fail but not at the real source of the problem.
I’ll try and clarify. If the input was “abcbcbcd” to a parser:
sequence([
char(?a),
many(
sequence([
char(?b),
char(?c),
])
),
char(?d)
])
The many
would consume the “bc” pairs until it reached the “d”, i.e. char(?b)
returns an error. At which point sequence
parsing would fail and the many
would return ok with the “bc” pairs it had matched as an AST. The input would be rewound to “d” and parsing would continue with char(?d)
which then matches.
Now let’s imagine the user erroneously writes “abcbbcd” as their input, they forgot one of those "c"s. Now the sequence
parsing “bc” would fail when it hit a “b” while expecting a “c”. The many
now fails and rewinds the input to “b”.
Now char(?d)
attempts to match “b” and fails.
Of course, we expect a failure but the parser appears to be failing at char(?d)
with “unexpected character: b expecting d” whereas the real problem is back inside the many
and should have failed with “unexpected character: b expecting c”. This makes debugging or reporting errors to the user difficult.
The problem is that many
is unable to disambiguate its child parser p
failing due to “end of sequence” vs. p
failing on an input that should be parsed there.
I’ve written a longer post about this that has a more realistic example.
My current thinking is to add a meta-parser called something like commit
that says “when we reach this point we’ve got something we should be able to parse. If this fails it’s not an end of sequence but a real error” and somehow many
then knows to return the right error at the right point.
At this point, it’s probably worth mentioning that Ergo passes around a %Context{}
structure that holds the remaining input and maintains line, column, and parsing status. A hypothetical commit
parser could be setting a flag in the context.
I’d welcome any insight into this situation, thank you.
Matt