Can you help me understand how to implement NimbleParsec error handling?

mattmower · January 5, 2021, 12:23pm

I am writing a parser using NimbleParsec. Although I am relatively new to Elixir and parser combinators I’m finding it mostly easy to get started with, although the documentation seems… sparse… and have a mostly working parser.

However I am completely at sea in terms of error handling. Most of the time when the parser fails, no matter where the problem is, I get a single unhelpful message about expecting an end-brace at the top-level. The line/byte-offset information is not very intelligible (In my experience there is no information about column). My attempts to use labels seem to obscure rather than helping. In short, I have no clue how I am meant to implement user-intelligible error handling into my parser.

I’m going through every parser I can find on Github (about 20 so far) and, as yet, I have found no/minimal evidence of any error handling. I want to put the tool that uses this parser into the hands of users so I can’t leave it as it is.

Can anyone point to any documentation on this topic or any examples of a NimbleParsec parser that does a reasonable good job of printing out the line & column pointing to the offending token and comprehensible explanation of what is wrong & what was expected?

Many thanks in advance.

Matt

mattmower · January 5, 2021, 8:39pm

I am actually finding the error reporting to be misleading. I’m perfectly willing to believe that it’s me that is the problem but, for example, I have a rule:

attributes =
    ignore(string("@{"))
    |> concat(attribute)
    |> repeat(
      ignore(whitespace)
      |> ignore(string(","))
      |> ignore(whitespace)
      |> concat(attribute)
    )
    |> ignore(cbrace)
    |> reduce({:reduce_args, []})
    |> label("attributes")

You will notice that you must define at least one attribute. An example where it is used is here:


  scene =
    string("Scene")
    |> replace(:scene)
    |> ignore(whitespace)
    |> concat(id)
    |> ignore(whitespace)
    |> ignore(obrace)
    |> ignore(whitespace)
    |> optional(attributes)
    |> ignore(whitespace)
    |> ignore(cbrace)
    |> label("scene def")
    |> wrap

In one example I forgot about that requirement to define an attribute:

Scene #intro {
      @{        
      }
    }

but the error I got was:

iex(67)> HygeParser.parse_file("examples/example1.hygs")
{:error, "expected scene def", "@{        \n      }\n    }\n\n  }\n\n}", %{}, {8, 83}, 89}

The line number is right. The byte_offset well it’s pointing at the first \n character but what is it telling me exactly? And I’m not sure why I’m getting two offsets rather than a column number that would be considerably more useful.

If remove the label from scene I get an error that makes just as little sense:

iex(69)> HygeParser.parse_file("examples/example1.hygs")
{:error, "expected ASCII character equal to '}'", "@{\n      }\n    }\n\n  }\n\n}", %{}, {8, 83}, 89}

Surely it was expecting anything except a ‘}’ since it requires me to define an attribute.

This stuff is making it very difficult for me to figure out when I make a mistake and I’ve no idea how I turn this into something that a user would expect to see if they make a mistake in creating a file.

Help would be greatly appreciated.

Thanks.

Matt

kip · January 5, 2021, 10:04pm

Error reporting is not the best part of nimble_parsec for sure. In your case, I suspect you are seeing the result of this comment in the source where the error is being attributed to the start of the scene combinator.

Do you also label the attribute combinator? If not, does adding a label to it make the error return more relevant?

There are definitely occasions where I have thrown an exception to capture a specific error in parsing to ensure I return an error to the user that is more understandable. I’m not happy with it but it works - especially when the parsing is a compile time action as it seems yours may be.

ityonemo · January 5, 2021, 10:50pm

It’s hard to tell exactly what’s going on since we’re missing a lot of context, but I recommend a few “best practices” with nimbleparsec.

here is my example:

github.com

ityonemo/zigler/blob/master/lib/zig/parser.ex

defmodule Zig.Parser do
  @moduledoc """
  main parsing module:  Handles general parsing of nif, resource directives
  and code elements succeeding those.
  """

  defstruct [:local, file: "", zig_block_line: 0, global: []]

  require Logger

  import NimbleParsec

  alias Zig.Parser.{Nif, Resource, ResourceCleanup}

  @type t :: %__MODULE__{
    local: Nif.t | ResourceCleanup.t | {:doc, iodata},
    file: Path.t,
    zig_block_line: non_neg_integer,
    global: [Nif.t]
  }

This file has been truncated. show original

initialize your parser context as a struct (https://github.com/ityonemo/zigler/blob/master/lib/zig/parser.ex#L68)
avoid using anything besides structs in your context
each combinator should be a series of textual matching, only followed by single (optional post_traverse), that analyzes all text that you have kept around. The only other place I use post_traverse is for internal validations, but I make sure that these traversers don’t alter context state, they only raise on syntax error.
I make a conscious decision to either throw away all of the text always or, only do textual transformations. You can put arbitrary terms in your outputs, but you probably shouldn’t. It’s better to stash them in the context and pull them out at the end. Thus in most cases I avoid label, for example. Here is an example of clear helper function I wrote to make sure that I’m in a sane state after any block. https://github.com/ityonemo/zigler/blob/master/lib/zig/parser.ex#L410.
create tests for your more complex intermediate combinators: https://github.com/ityonemo/zigler/blob/master/lib/zig/parser.ex#L151 that you can verify in unit testing. You don’t have to keep them around in dev/prod.
don’t be afraid to raise. I typically use SyntaxError. If you’re new to elixir, when you’re testing, assert_raise SyntaxError, fn <erroring code> end is a thing.

mattmower · January 6, 2021, 9:29am

Ah, yes, thank you for that pointer and I think you are right. It says

in case of errors but such can be addressed if desired.

although, at this point, I have no idea what that means. Is it saying that it could be fixed in NimbleParsec? Or that a user of the library can fix it in their parser?

Yes. I’ve tried labelling & not-labelling everything to see if it improves the situation and arguably I have found labels make it even harder to figure out what is going on. But, perhaps I am using them wrong. Absent some good examples I find it difficult to understand what is good practice.

[quote=“kip, post:3, topic:36637, full:true”]There are definitely occasions where I have thrown an exception to capture a specific error in parsing to ensure I return an error to the user that is more understandable. I’m not happy with it but it works - especially when the parsing is a compile time action as it seems yours may be.
[/quote]

It’s a side project I am working on that is a bit like Twine (i.e. for creating hypermedia games) so an author writes a ‘game’ file and it generates an HTML/JS app to run the game in the browser.

Do you have any examples of parsers where you (or someone else) does this? I’m not really sure how or where I would throw errors to make this better. It would be great to see something that makes an effort to address these kind of problems.

I wonder if this issue is about how parser combinators work? Or a function of the maturity of NimbleParsec itself?

Thanks again.

Matt

mattmower · January 6, 2021, 9:34am

Thanks ityonemo, will take me a ltitle while to digest your advice but it’s great to have a fuller example to learn from, thank you for that.

m@t

josevalim · January 6, 2021, 8:19pm

Yes. When we fuse combinators, it points to where we started fusing. But this can be addressed by executing the non fused version when the fused version fails. It is trivial to add. If someone has a reproducible example of when the reported line and offset are subpar or wrong, please open up an issue!

ityonemo · January 6, 2021, 10:25pm

I would caution reading that parser is that I have a confusing “whitespace/blankspace” definition … but it is on the to-do list to refactor

mattmower · January 7, 2021, 10:56am

Hi Jose.

I understand the words but cannot put them into action. Do you have an example I could learn from?

This is a direct issue for me right now. As it stands I cannot imagine putting my parser in the hands of users because the error information is going to be misleading/confusing. So I need to solve this problem. I will keep digging but I am finding the lack of examples makes it hard going.

I am also not sure why the error information reports {…{line,offset},offset,…}. The two offsets are always close but not exact. I’m confused at what the difference is. But, more than this, I think a user is going to expect a column not a byte_offset and it’s not clear to me whether NimbleParsec doesn’t do this itself because it’s hard to do? If left as an exercise some pointers about how to add column tracking would be appreciated.

Thanks.

Matt

sasajuric · January 7, 2021, 11:31am

A few years ago I was working on a SQL query parser, and did some tweaking to improve error handling. At the time nimble_parsec didn’t exist, we were using combine, but I think that the same approach would hold, with a major difference that we were able to make it work with combine without changing the library. In contrast I think you could only do this with nimble if you modify the library (I might be wrong though).

I found that combinators such as choice and repeat are not very good at error reporting. For example let’s say we want to parse an arithmetic expression such as 1 + (2 + 3) + 4. Somewhere, we’ll have a parser named term which is defined choice([integer, parenthesized_expression]). Now, let’s say we parse the expression 1 + (2 + 3. We want to report that closing parenthesis is missing. However, choice is too generic, so we can only report something like expecting a valid term.

We solved this by adding a custom parser we called switch, which would branch based on a lookahead. If we see that the next character is (, we commit to parenthesized_expression. If the next character is a digit, we commit to integer. Otherwise, we return a custom error. Since we commit to the choice, early, we can return the more precise error from the chosen branch. We used the same approach to improve error reporting with the optional parser.

Another issue is repeat. For example, let’s say you want to parse a list of arguments, (arg1, arg2, ...). You’ll have something like repeat(arg_parser). However, since repeat always succeeds, we can’t return a precise error that occurred in some argument. For example, if there is an error in arg3, repeat will succeed, consuming the first two args, and then the best we can do is return something like unexpected , character.

This can be handled in a similar way as choice. We first do a switch lookahead to decide if we’ll parse the first argument (next char is a valid start of the argument) or not (next char is a closing parenthesis). For every subsequent argument, we do a lookahead switch to check if there’s a next argument (next char is ,) or not (next char is a closing parenthesis). This allows us to precisely report an error in a repeated element.

The changes above will uglify the parser code. In general, it’s my experience that precise error reporting complicates the parser implementation. In our SQL parser we were able to control readability by doing two-phase parsing. We’d first convert the string intu a list of tokens such as {:constant, 123}, {:identifier, "foobar"}, … (aka tokenization), and then in the second pass we’d do parse the list of tokens according to the SQL grammar. This approach made the code of the parser significantly easier to follow and change.

It was possible to do this with combine, but I don’t think you can make it work with nimble, because nimble always expects a binary input, and if you’re doing two passes, your input in the 2nd pass is the list of terms, not a string of chars. Therefore, my impression is that nimble is better suited for the cases where precise error reporting is not required, or the cases where the grammar is not too complicated. Otherwise, I’d either roll my own combinator (I did a talk on this, you can find it here), or hand-write a recursive-descent parser. The latter will give you maximum flexibility and speed, but the code will be significantly noisier.

mattmower · January 7, 2021, 12:12pm

Sorry in the hubbub of thinking about my problem I forgot to add something very important.

Thank you for creating and sharing NimbleParsec. It’s a great library and I look forward to creating my own little work on top of it.

Matt.

josevalim · January 7, 2021, 5:30pm

It is mostly an implementation concern that you should generally not worry about but I am glad to provide more information:

You can think that every combinator in NimbleParsec compiles to a function. Therefore, if you had something like this:

ascii_char([?a..?z]) |> ascii_char([?0..?9])

We would compile it down to this (simplified):

defp node_1(<<x, rest::binary>>) when x in ?a..?z, do: node_2(rest)
defp node_1(_), do: :error

defp node_2(<<x, rest::binary>>) when x in ?0..?9, do: node_2(rest)
defp node_2(_), do: :error

defp node_3(rest), do: {:ok, rest}

However this approach would be expensive both in terms of compilation and at runtime. So we compile it to this:

defp node_1(<<x, y, rest::binary>>) when x in ?a..?z and y in ?0..?9, do: node_3(rest)
defp node_1(_), do: :error

defp node_3(rest), do: {:ok, rest}

You can see “node_1” and “node_2” were fused into the same pattern. The downside of this approach is that if you fail on “node_2”, because they were fused, the pattern reports it failed at “node_1”.

But this should not affect the parsing behaviour, it only affects the reported lines in case of errors.

The first offset is the offset on the current line. The second is the offset since the beginning of the binary. The offset will be the column if you are parsing ascii. However, if you are parsing something with UTF-8 support, then in case of errors you will have to build the column using the line and offset. The algorithm would go like this:

Get the original input
Find the nth line given by line
Now get the offset bytes for that line and call String.length(binary_part(line_str, 0, offset))

We can probably add a helper function to do this in NimbleParsec.

mattmower · January 7, 2021, 5:50pm

This doesn’t seem to be what I’m seeing. According to file I am parsing a file that is ASCII text but the “line offset” is clearly not referring to a column of that line (e.g. getting a value of 60 something in a line with 10 characters in it). I’ll have to come up with a concrete example later but, I don’t think I am seeing columns.

josevalim · January 7, 2021, 6:21pm

Sorry, I was speaking from memory and it seems I misspoke.

Please try this interpretation: the offset in {line, offset} is the offset the current line starts. And the other offset is either the offset to the current line or the offset in comparison to the whole binary. If the latter, you can get the line offset by doing binary_offset - line_offset.

mattmower · January 8, 2021, 9:34am

I’m going to need a little more time to digest your comments Sasa but I wanted to say thanks for the link to your talk which I just watched and which really helped me to understand what was going on with the combinator approach. It’s not a technique I had used before and I kind of skipped over the underlying concepts in my hurry. Thank you.

mattmower · January 10, 2021, 10:33am

That’s the context returned in:

{:ok, [token], rest, context, position, byte_offset} or
{:error, reason, rest, context, line, byte_offset}

?

At this point, I am not sure how to interact with the context but I generally concur with the idea of replacing maps with structs when you know what the structure should look like.

Is this the key to implementing better error handling? That you adorn the context with more information?

I’m not quite sure what you mean. Are you restating point 1 or do you mean that your context should also only contain structs? That doesn’t seem to be the case in the ZigParser. Can you clarify?

I’m really not sure what guidance you are giving here. All of the combinators are textual matching… I think you are talking about something more subtle? I’ve not used post_traverse so far only map and reduce for converting to lists and maps. Do you have an example of what you are saying not to do?

I confess that I am really not following this at all.

That’s good advice. I am switching from defining helpers as inline variables to functions in a helper module which will, I think, help with that.

Looking at where you are doing this I think it relates to my confusion in the points above. I get the idea that you would raise SyntaxErrors (or possibly “SemanticErrors”) but there is something here about the combinator/context/post_traverse interaction that you are assuming as good practice and I have not grasped yet.

Ah well, beginners mind. If you care to expand on any of your points that would be great but you have, in any case, given me a lot to think about. Thank you.

Matt

mattmower · January 10, 2021, 10:45am

I’m writing a game definition parser (not a natural language parser) as part of an interactive fiction game engine. So I will be parsing primarily human authored text.

In the first case that text will be written by me and I have a low tolerance for being frustrated by error messages (one of my constant bug bears with Clojure). So error handling & reporting are important to me.

Certainly if I imagine anyone else might use it I think it’s essential that an error message be helpful about the position and likely nature of what is wrong.

My understanding is that Jose has done a lot of work on the resulting parsers to make sure they are very performant. While speed is always welcome I’m not writing a parser for HTTP requests where every microsecond counts.

Is the hand-rolled combinator long-term feasible? What would I regret by forgoing NimbleParsec and hand-rolling as you outlined in your video. It seems pretty easy to get started with, but I wonder what I am missing. I guess there is a way to find out…

Thanks.

Matt

mattmower · January 10, 2021, 10:50am

That makes sense, thanks, Jose. I’m not sure if I’ve done it right but I think I’ve submitted a PR to clarify the docs.

sasajuric · January 10, 2021, 2:12pm

If precise error handling is required, I personally wouldn’t use Nimble, except maybe for tokenization.

Hand-rolling a parser for complex languages is definitely feasible. For example see this PR in Gleam where switching to a hand-rolled parser improved error reporting and parsing speed. Also, I recall reading that at some point GCC switched to a hand-rolled recursive-descent parser to improve error reporting.

In cases where the grammar is more involved, I’d probably start with a hand-rolled parser combinator. Being in control of the combinator should provide maximum flexibility, while the parser code should remain readable.

If maximum performance is required, parts of the parser could be converted to manual recursive descent. For example of this technique, take a look at this parser which parses regex-like inputs (input example: "^ENWWW(NEEE|SSE(EE|N))$", full problem description is here).

To better understand this, I advise trying these techniques on a small toy grammar. An arithmetic expression parser which supports parentheses and operator precedence is IMO a great example. You could develop it incrementally, e.g.:

Support flat expression with just one operator (e.g. 1+2+34).
Add support for ignoring whitespaces
Introduce the * operator, and handle operator precedence (e.g. parsing 1 * 2 + 3 * 4 should return something like {+, [{*, [1, 2]}, {*, [3, 4]}}).
Add support for parentheses
Improve quality of reported errors

Start by implementing this with the recursive descent technique. Then see how could it be done with the combinators. This should help you understand the differences between these two approaches.

As a bonus exercise, try implementing two-pass compiling. This can be done by using e.g. Nimble to convert the original input into a list of tokens (e.g. convert 1 * (2 + 3) into [{:integer, 1}, {:operator, "*"}, {:operator, "("}, ...]), and then converting such list into the final ast. IMO, when grammar is more complex, explicit tokenization pass can do wonders for code readability.

Finally, you can also consider using yecc. IIRC, Elixir parser is based on yecc. Also, absinthe does two-phase parsing, using nimble for the first pass, and yecc for the second.

I understand this may all seem overwhelming, and I barely scratched the surface of the complex topic of parsing So as a final parting advice, I think that starting with a hand-rolled combinator is probably a sensible choice which will give you a lot of control and flexibility, while the code should still be readable. As soon as the grammar becomes more complex, consider doing two-phase parsing, with the first pass powered by e.g. nimble.

mattmower · January 11, 2021, 11:39am

Given I have the luxury of time I have decided to try and roll my own system of parser combinators along the lines you describe in your video but starting from the principle of providing good error reporting. I’ve made good progress on the simple stuff and will try and separate it from the project and upload to Github so others can comment. If nothing else I hope to come back to Nimble with a better perspective.

One thing I have done, for better or worse, is define a “parser context” struct type that contains among other things the input, position, and term data and that parsers are written in terms of a context rather than a binary input. This slightly complicates the beginning in that you have to wrap the initial input in a context but I think makes it easier to do the right thing later. Please someone tell me if I am making a ghastly mistake here.