A few years ago I was working on a SQL query parser, and did some tweaking to improve error handling. At the time nimble_parsec didn’t exist, we were using combine, but I think that the same approach would hold, with a major difference that we were able to make it work with combine without changing the library. In contrast I think you could only do this with nimble if you modify the library (I might be wrong though).
I found that combinators such as choice
and repeat
are not very good at error reporting. For example let’s say we want to parse an arithmetic expression such as 1 + (2 + 3) + 4
. Somewhere, we’ll have a parser named term
which is defined choice([integer, parenthesized_expression])
. Now, let’s say we parse the expression 1 + (2 + 3
. We want to report that closing parenthesis is missing. However, choice
is too generic, so we can only report something like expecting a valid term
.
We solved this by adding a custom parser we called switch
, which would branch based on a lookahead. If we see that the next character is (
, we commit to parenthesized_expression
. If the next character is a digit, we commit to integer
. Otherwise, we return a custom error. Since we commit to the choice, early, we can return the more precise error from the chosen branch. We used the same approach to improve error reporting with the optional
parser.
Another issue is repeat
. For example, let’s say you want to parse a list of arguments, (arg1, arg2, ...)
. You’ll have something like repeat(arg_parser)
. However, since repeat always succeeds, we can’t return a precise error that occurred in some argument. For example, if there is an error in arg3
, repeat will succeed, consuming the first two args, and then the best we can do is return something like unexpected , character
.
This can be handled in a similar way as choice
. We first do a switch lookahead to decide if we’ll parse the first argument (next char is a valid start of the argument) or not (next char is a closing parenthesis). For every subsequent argument, we do a lookahead switch to check if there’s a next argument (next char is ,
) or not (next char is a closing parenthesis). This allows us to precisely report an error in a repeated element.
The changes above will uglify the parser code. In general, it’s my experience that precise error reporting complicates the parser implementation. In our SQL parser we were able to control readability by doing two-phase parsing. We’d first convert the string intu a list of tokens such as {:constant, 123}
, {:identifier, "foobar"}
, … (aka tokenization), and then in the second pass we’d do parse the list of tokens according to the SQL grammar. This approach made the code of the parser significantly easier to follow and change.
It was possible to do this with combine, but I don’t think you can make it work with nimble, because nimble always expects a binary input, and if you’re doing two passes, your input in the 2nd pass is the list of terms, not a string of chars. Therefore, my impression is that nimble is better suited for the cases where precise error reporting is not required, or the cases where the grammar is not too complicated. Otherwise, I’d either roll my own combinator (I did a talk on this, you can find it here), or hand-write a recursive-descent parser. The latter will give you maximum flexibility and speed, but the code will be significantly noisier.