NimbleParsec - a simple and fast parser combinator for Elixir

josevalim · December 13, 2018, 4:29pm

Offset is the byte offset the line starts. You can compute the column from the line offset and the overall offset.

OvermindDL1 · December 13, 2018, 5:02pm

Byte offset since the start of the line, so to calculate the column you’d need to acquire the original input (which could be lost at this point if it was passed in straight or is utterly huge and is streamed in if it supports streaming later) and calculate the UTF8 bytes between them. It wouldn’t be a huge calculation each time but if you are storing line, column, and byte-in-inputs location on every single AST node of an entire parsed language means essentially the entire input is going to be read over and over again many times to calculate these positions since you have to iterate over UTF8 to get the character count, instead of just keeping a running total as positions are calculated as mine does? Line/Column information is way common and is beyond useful for accurate error messages, note that both my handcrafted version and ExSpirit’s output detailed error messages right down to line and column locations (which having good error messages tends to be an absolute requirement for any half-decent or better parser). How would you implement this efficiently in NimbleParsec?

josevalim · December 13, 2018, 5:52pm

You can very easily retrieve the column information for error messages because you always have the original binary:

case call_nimble_parsec(input) do
  {:ok, ...} ->
  {:error, ..., {line, line_offset}, total_offset, ...} -> compute_col(input, line_offset, total_offset)
end

The concern is if you want to include the information on every token then doing so would be expensive… but nobody has asked this feature so far, so I am glad to punt on it. Once people need it, then we will figure it out, it is that simple. For now we are happy to go with the most efficient approach!

sasajuric · December 13, 2018, 10:16pm

At Aircloak we’re parsing SQL queries using combine (nimble didn’t exist at the time). It’s not likely we’ll change to something else, because after adding some custom extensions (like lookaheads and support for recursive grammars), it works good enough for our needs, and switching at this point would consume too much time.

That said, maybe our use case might be interesting. We definitely need column information, both at parsing time, and in the later phases when we’re doing semantic analysis. So even after parsing is done, we might report an error, and we need to include precise line/column location. In fact, the need for column location was one of the reasons why we chose combine over leex/yecc.

If we were to use nimble today, we’d definitely need the same ability. If I understand properly, this is in fact possible with nimble if we include line and byte offset with every emitted token, and then compute the column offset from that using the original input string. Though somewhat clumsy, this would be acceptable for our case. That said, I think that having some out-of-the-box help for easily getting the column offset would be nice.

OvermindDL1 · December 13, 2018, 11:29pm

Heh, well it’s a feature I’ve used in almost all my language parsers, putting the line/col/offset information on each node allows me to trivially reproduce the original text from the AST, as well as it is fantastic for jump-to-things in IDE’s and a lot more.

I cannot overstate just how important Error reporting like this is, it can literally be the difference between a project being nice to use and being debugging-hell. That is why Spirit/ExSpirit bake that information into everything.

@sasajuric for note, ExSpirit has all the capabilities of Combine, has been around for 2 years at this point, is faster than Combine, etc… etc… There’s no reason to use Combine over ExSpirit except maybe building speed (which good use of rules can help fix that, as well as if I finish the Parserx branch, which I’m considering…)

josevalim · December 14, 2018, 12:34am

Oh, a very important piece of information I forgot about but is actually one of the main reasons why I decided to not add columns for now: your XML example is not computing the column information correctly because it is working on codepoints and column information should work on graphemes. So proper column information is really expensive to compute as we go, because you can’t do it properly without consulting the Unicode database. If ExSpirit is also computing the column information based on codepoints, then it is actually inaccurate for a huge variety of languages.

If you need the column only for error reporting, then I believe what NimbleParsec provides is the best because correct column information requires computing graphemes and we don’t want to slow down parsing for something you would need only in case of errors.

If your input is ASCII only, then the column information can be retrieved by subtracting the offsets. Otherwise, you can do input |> binary_part(line_offset, offset - line_offset) |> String.length. The situations where NimbleParsec’s approach is not enough are:

You need the column on every token (as you would need if you were to compile Elixir to bytecode because the column information can be used in stacktraces)
You no longer have the original input after parsing

However, if you need the column for every token, I don’t see how we can escape from computing it with String.length, because we simply cannot compute it as we traverse.

tmbb · December 14, 2018, 11:35am

This can lead to a combinatorial explosion of implementations, where you have for example a function with three prameters, each of which can be either static or dynamic. To optimize as much as possible, you’d have to code 2x2x2=8 implementations, one for each of the possible combinations of static and dynamic. That quickly increases the size of your code…

On the other hand, you can have only two implementations: one with everything static and another one with everything dynamic.

josevalim · December 14, 2018, 12:26pm

Yeah, that’s the route I would go. And for things like times(:max) I believe one implementation would be enough, or rather, one conditional because it would be a matter of replacing this line:

github.com

dashbitco/nimble_parsec/blob/master/lib/nimble_parsec/compiler.ex#L467


      
          # First add the initial accumulator to the stack
          {entrypoint, step} = build_next(step, config)
          head = quote(do: [rest, acc, stack, context, line, offset])
          args = quote(do: [rest, acc, [acc | stack], context, line, offset])
          body = {entrypoint, [], args}
          current_def = {current, head, true, body}
          
          # Now define the failure point which will recur
          {failure, step} = build_next(step, config)
          failure_def = build_eventually_next_def(entrypoint, failure)
          config = update_in(config.acc_depth, &(&1 + 1))
          catch_all_def = build_catch_all(:positive, failure, combinators, config)
          
          # And compile down the inner combinators
          config = %{config | catch_all: failure, acc_depth: 0}
          {defs, inline, success, step} = compile(combinators, [], [], entrypoint, step, config)
          
          # In the exit remove the accumulator from the stack
          {exitpoint, step} = build_next(step, config)
          head = quote(do: [rest, acc, [_ | stack], context, line, offset])
          args = quote(do: [rest, acc, stack, context, line, offset])

by

args = quote(do: [rest, acc, [Map.fetch!(context, unquote(key)) | stack], context, line, offset])

The issue is times(:min) which we do optimize by emitting static code and it would have to be dynamic.

tmbb · December 14, 2018, 12:45pm

Exactly I would base everything on a from_context/2 function of the form from_context(key, transformer \ fn x -> x end) which would insert transformer.(Map.get(context, key)) into the right place.

That seems fully general and quite simple to use. The from_context/2 itself would merely return a piece of AST, of course. The parsec compiler would know what to do with it.

Unrelated to this, there is a quality of life improvement I’d like to have. Currently, the AST contains no line or column metadata. That menas that when there is a compilation error in the parser, we get a nasty error shose source is not very clear.

Mayeb we could add a macro which would wrap expressions and add source location to the parser AST. For example, something like this:

combinator do
  comb1 = x |> y |> z
end

Which would annotate the value of comb1 with the source location of the combinator macro or even the comb1 definition.

josevalim · December 14, 2018, 12:56pm

This is compile time, so non anonymous functions for us. You will have to guarantee it is there the context and use a previous step to put it there.

You can implement this macro yourself with pre_traverse/post_traverse.

tmbb · December 14, 2018, 3:27pm

I don’t think you understand what I mean. I’m not talking about adding line and column info to the prsed tokens. That’s trivial. I’m talking about adding line and column information to the structure you feed into defparsec so that if something is invalid you know where to look for the mistake in the .ex file that defines the parser.

OvermindDL1 · December 14, 2018, 4:14pm

Ah true true, that is something that needs to be made more simple. What would your recommendation be for grabbing the first grapheme from a binary and return that and the rest of the untouched binary?

Primarily I’ve always baked it into the AST, it’s a pattern in almost every compiler (though they often do it via stream ranges, which is not a pattern Elixir has built in but you could emulate it easily enough with sub-binary though that prevents accessing outside of it, which is occasionally useful though not often).

This is often a thing in other parsers, not really been an issue for me in my Elixir parsing, but definitely in other languages you have only a one-shot iterator, which the standard one-shot input iterator is all that’s required for C++'s Spirit as long as you wrap it with a buffered iterator that only buffers enough to allow for rollbacks for alternatives and so forth where points like expect and a few others flush it back since you know you can’t go back any further, that’s a big thing about expect too that I didn’t touch on here as it’s not relevant for Elixir. A ‘Stream’ supporting parser would be interesting and it was planned to be supported in Parserx, but I just hadn’t had a reason to use it yet (though I can definitely see a lot of reasons).

Yeah this would be highly useful, I intended to but never got around to implementing this in ExSpirit and it would be SO useful when building grammars.

tmbb · December 14, 2018, 4:21pm

I’ve never had problems with that in ExSpirit, because the defrule macro gives decent enough error reporting

OvermindDL1 · December 14, 2018, 4:25pm

defrule’s rulestack was my work-around for doing it properly, but being able to ‘name’ the rules is still crazy useful I think. ^.^

josevalim · December 14, 2018, 6:20pm

Oooooh, I see what you mean. Unfortunately it is hard to do because they are not macros. We could maybe get the whole stacktrace… but that would be too much. Thoughts ?

I am not sure I follow.

OvermindDL1 · December 14, 2018, 6:32pm

What is your idea on the most efficient way to replace this:

def blah(<<c::utf8, r::binary>>), do: {c, r}

With a function that replaces that but returning the {grapheme, rest_of_binary}?

josevalim · December 14, 2018, 6:44pm

That function would be String.next_grapheme, yeah.

OvermindDL1 · December 14, 2018, 6:48pm

Ah useful, looks like it calls next_grapheme_size/1 in the String.Unicode module, so could skip the middle module-call and just call right to the source. Is the String.Unicode module’s interface public or is it not really intended for external use?

josevalim · December 14, 2018, 8:32pm

It is private, unfortunately.

OvermindDL1 · December 14, 2018, 8:35pm

Hmm, has there been any thought as to making at least part of String.Unicode public and stable so as to at least get rid of one remote module call for extremely hot code paths? This function seems a pretty good candidate for stability. Or perhaps make the defdelegate in String for it a public macro instead that calls to the unicode module? That would allow overriding it at a later date while getting rid of the runtime cost now.