NimbleParsec - a simple and fast parser combinator for Elixir

Offset is the byte offset the line starts. You can compute the column from the line offset and the overall offset.

Byte offset since the start of the line, so to calculate the column youā€™d need to acquire the original input (which could be lost at this point if it was passed in straight or is utterly huge and is streamed in if it supports streaming later) and calculate the UTF8 bytes between them. It wouldnā€™t be a huge calculation each time but if you are storing line, column, and byte-in-inputs location on every single AST node of an entire parsed language means essentially the entire input is going to be read over and over again many times to calculate these positions since you have to iterate over UTF8 to get the character count, instead of just keeping a running total as positions are calculated as mine does? Line/Column information is way common and is beyond useful for accurate error messages, note that both my handcrafted version and ExSpiritā€™s output detailed error messages right down to line and column locations (which having good error messages tends to be an absolute requirement for any half-decent or better parser). How would you implement this efficiently in NimbleParsec?

You can very easily retrieve the column information for error messages because you always have the original binary:

case call_nimble_parsec(input) do
  {:ok, ...} ->
  {:error, ..., {line, line_offset}, total_offset, ...} -> compute_col(input, line_offset, total_offset)
end

The concern is if you want to include the information on every token then doing so would be expensiveā€¦ but nobody has asked this feature so far, so I am glad to punt on it. Once people need it, then we will figure it out, it is that simple. For now we are happy to go with the most efficient approach! :+1:

At Aircloak weā€™re parsing SQL queries using combine (nimble didnā€™t exist at the time). Itā€™s not likely weā€™ll change to something else, because after adding some custom extensions (like lookaheads and support for recursive grammars), it works good enough for our needs, and switching at this point would consume too much time.

That said, maybe our use case might be interesting. We definitely need column information, both at parsing time, and in the later phases when weā€™re doing semantic analysis. So even after parsing is done, we might report an error, and we need to include precise line/column location. In fact, the need for column location was one of the reasons why we chose combine over leex/yecc.

If we were to use nimble today, weā€™d definitely need the same ability. If I understand properly, this is in fact possible with nimble if we include line and byte offset with every emitted token, and then compute the column offset from that using the original input string. Though somewhat clumsy, this would be acceptable for our case. That said, I think that having some out-of-the-box help for easily getting the column offset would be nice.

2 Likes

Heh, well itā€™s a feature Iā€™ve used in almost all my language parsers, putting the line/col/offset information on each node allows me to trivially reproduce the original text from the AST, as well as it is fantastic for jump-to-things in IDEā€™s and a lot more. :slight_smile:

I cannot overstate just how important Error reporting like this is, it can literally be the difference between a project being nice to use and being debugging-hell. That is why Spirit/ExSpirit bake that information into everything.

@sasajuric for note, ExSpirit has all the capabilities of Combine, has been around for 2 years at this point, is faster than Combine, etcā€¦ etcā€¦ Thereā€™s no reason to use Combine over ExSpirit except maybe building speed (which good use of rules can help fix that, as well as if I finish the Parserx branch, which Iā€™m consideringā€¦)

Oh, a very important piece of information I forgot about but is actually one of the main reasons why I decided to not add columns for now: your XML example is not computing the column information correctly because it is working on codepoints and column information should work on graphemes. So proper column information is really expensive to compute as we go, because you canā€™t do it properly without consulting the Unicode database. If ExSpirit is also computing the column information based on codepoints, then it is actually inaccurate for a huge variety of languages.

If you need the column only for error reporting, then I believe what NimbleParsec provides is the best because correct column information requires computing graphemes and we donā€™t want to slow down parsing for something you would need only in case of errors.

If your input is ASCII only, then the column information can be retrieved by subtracting the offsets. Otherwise, you can do input |> binary_part(line_offset, offset - line_offset) |> String.length. The situations where NimbleParsecā€™s approach is not enough are:

  1. You need the column on every token (as you would need if you were to compile Elixir to bytecode because the column information can be used in stacktraces)

  2. You no longer have the original input after parsing

However, if you need the column for every token, I donā€™t see how we can escape from computing it with String.length, because we simply cannot compute it as we traverse.

2 Likes

This can lead to a combinatorial explosion of implementations, where you have for example a function with three prameters, each of which can be either static or dynamic. To optimize as much as possible, youā€™d have to code 2x2x2=8 implementations, one for each of the possible combinations of static and dynamic. That quickly increases the size of your codeā€¦

On the other hand, you can have only two implementations: one with everything static and another one with everything dynamic.

Yeah, thatā€™s the route I would go. And for things like times(:max) I believe one implementation would be enough, or rather, one conditional because it would be a matter of replacing this line:

by

args = quote(do: [rest, acc, [Map.fetch!(context, unquote(key)) | stack], context, line, offset])

The issue is times(:min) which we do optimize by emitting static code and it would have to be dynamic.

1 Like

Exactly :slight_smile: I would base everything on a from_context/2 function of the form from_context(key, transformer \ fn x -> x end) which would insert transformer.(Map.get(context, key)) into the right place.

That seems fully general and quite simple to use. The from_context/2 itself would merely return a piece of AST, of course. The parsec compiler would know what to do with it.

Unrelated to this, there is a quality of life improvement Iā€™d like to have. Currently, the AST contains no line or column metadata. That menas that when there is a compilation error in the parser, we get a nasty error shose source is not very clear.

Mayeb we could add a macro which would wrap expressions and add source location to the parser AST. For example, something like this:

combinator do
  comb1 = x |> y |> z
end

Which would annotate the value of comb1 with the source location of the combinator macro or even the comb1 definition.

This is compile time, so non anonymous functions for us. You will have to guarantee it is there the context and use a previous step to put it there.

You can implement this macro yourself with pre_traverse/post_traverse. :slight_smile:

I donā€™t think you understand what I mean. Iā€™m not talking about adding line and column info to the prsed tokens. Thatā€™s trivial. Iā€™m talking about adding line and column information to the structure you feed into defparsec so that if something is invalid you know where to look for the mistake in the .ex file that defines the parser.

Ah true true, that is something that needs to be made more simple. What would your recommendation be for grabbing the first grapheme from a binary and return that and the rest of the untouched binary?

Primarily Iā€™ve always baked it into the AST, itā€™s a pattern in almost every compiler (though they often do it via stream ranges, which is not a pattern Elixir has built in but you could emulate it easily enough with sub-binary though that prevents accessing outside of it, which is occasionally useful though not often).

This is often a thing in other parsers, not really been an issue for me in my Elixir parsing, but definitely in other languages you have only a one-shot iterator, which the standard one-shot input iterator is all thatā€™s required for C++'s Spirit as long as you wrap it with a buffered iterator that only buffers enough to allow for rollbacks for alternatives and so forth where points like expect and a few others flush it back since you know you canā€™t go back any further, thatā€™s a big thing about expect too that I didnā€™t touch on here as itā€™s not relevant for Elixir. A ā€˜Streamā€™ supporting parser would be interesting and it was planned to be supported in Parserx, but I just hadnā€™t had a reason to use it yet (though I can definitely see a lot of reasons).

Yeah this would be highly useful, I intended to but never got around to implementing this in ExSpirit and it would be SO useful when building grammars.

Iā€™ve never had problems with that in ExSpirit, because the defrule macro gives decent enough error reporting

defruleā€™s rulestack was my work-around for doing it properly, but being able to ā€˜nameā€™ the rules is still crazy useful I think. ^.^

Oooooh, I see what you mean. Unfortunately it is hard to do because they are not macros. We could maybe get the whole stacktraceā€¦ but that would be too much. :frowning: Thoughts ?

I am not sure I follow. :slight_smile:

What is your idea on the most efficient way to replace this:

def blah(<<c::utf8, r::binary>>), do: {c, r}

With a function that replaces that but returning the {grapheme, rest_of_binary}?

That function would be String.next_grapheme, yeah.

Ah useful, looks like it calls next_grapheme_size/1 in the String.Unicode module, so could skip the middle module-call and just call right to the source. Is the String.Unicode moduleā€™s interface public or is it not really intended for external use?

It is private, unfortunately.

Hmm, has there been any thought as to making at least part of String.Unicode public and stable so as to at least get rid of one remote module call for extremely hot code paths? This function seems a pretty good candidate for stability. Or perhaps make the defdelegate in String for it a public macro instead that calls to the unicode module? That would allow overriding it at a later date while getting rid of the runtime cost now. :slight_smile: