NimbleParsec - a simple and fast parser combinator for Elixir

josevalim · March 2, 2018, 8:41pm

Yes, yet another parser combinator library!

Most of the parser combinators in the ecosystem are either compile-time, often using AST traversal and macros, which hurts composition, or are runtime based, which means it is slow when parsing. But more importantly, I haven’t found no library compiles parser combinator down to binary matches that rely on the VM optimizations.

So over the last 48h I built a yet another parsec combinator library for Elixir called NimbleParsec. The combinator composition happens fully at runtime which is then compiled down to binary matching. It works similar to quoted expressions in Elixir. The combinators build an AST which is then compiled down to binary match clauses. See the link above for an example and the code it compiles down to.

I have ran @OvermindDL1 benchmark scripts and I got these results:

$ mix bench
Erlang/OTP 20 [erts-9.0] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false]
Elixir 1.7.0-dev
Benchmark suite executing with the following configuration:
warmup: 2.0s
time: 3.0s
parallel: 1
inputs: parse_datetime, parse_int_10
Estimated total run time: 30.0s


Benchmarking with input parse_datetime:
Benchmarking combine...
Benchmarking ex_spirit...
Benchmarking nimble...
Warning: The function you are trying to benchmark is super fast, making measures more unreliable! See: https://github.com/PragTob/benchee/wiki/Benchee-Warnings#fast-execution-warning


Benchmarking with input parse_int_10:
Benchmarking combine...
Benchmarking ex_spirit...
Benchmarking nimble...
Warning: The function you are trying to benchmark is super fast, making measures more unreliable! See: https://github.com/PragTob/benchee/wiki/Benchee-Warnings#fast-execution-warning


##### With input parse_datetime #####
Name                ips        average  deviation         median
nimble        1425.75 K        0.70 μs   ±437.03%        0.70 μs
ex_spirit      177.70 K        5.63 μs   ±115.86%        5.00 μs
combine         95.83 K       10.44 μs    ±93.60%        9.00 μs

Comparison:
nimble        1425.75 K
ex_spirit      177.70 K - 8.02x slower
combine         95.83 K - 14.88x slower

##### With input parse_int_10 #####
Name                ips        average  deviation         median
nimble         949.95 K        1.05 μs   ±216.77%        1.00 μs
ex_spirit      463.71 K        2.16 μs   ±950.25%        2.00 μs
combine        338.62 K        2.95 μs   ±760.53%        2.00 μs

Comparison:
nimble         949.95 K
ex_spirit      463.71 K - 2.05x slower
combine        338.62 K - 2.81x slower

The above shows that for parsing datetimes, nimble is 8x faster than ex_spirit and 14x faster than combine. For the integer case, nimble is only twice faster, but it is worth noting Nimble’s integer parser is written on top of existing combinators while the integer parser for both ex_spirit and combine are written by hand. So nimble is beating hand-written code there.

I did not measure memory usage but that should also decrease wth nimble thanks to binary matching.

I have also benchmarked compilation times by compiling the same datetime parser 30 times. combine takes 1s, which makes sense as it is runtime based. nimble takes 2s and ex_spirit takes 6s.

While nimble is extremely new, I think most of the primitives are there, so you should be able to build almost anything. Improvements, PRs and feedback are very welcome, thanks!

OvermindDL1 · March 2, 2018, 11:30pm

Whooo thank you! That looks like the expression based generator output I had started to write! Whooo! I do not want to have to manage more libraries than I have to so I’m glad to have a better one out, thank you! ^.^

A few questions before I delve too far in:

Upon output and errors does it give at the very least the byte location, UTF-8 column, UTF-8 line, and the rest of the stream itself?

How’s its error handling, detailed errors? Line/Column info and all? Is it 1-based or 0-based?

For one of my own projects I really do need to parse very weird integral bases (which is why ex_spirit supports bases from 2 all the way up to base 36), does it support these (and are they chooseable based on prior parsed data instead of having to hardcode all variants, I was fixing that bit in my expression based parser but not got around to it yet)?

Is there any way of passing parsing state through the parser (the ‘state’ functions in ex_spirit) as those are pretty well required for most of the things I write and what can they effect?

I.E. All functionality currently in ex_spirit is functionality that I’ve needed so far so I’d need equivalent methods to be able to replace it, is all such functionality possible in nimble?

P.S. I love how you name things NimbleSomething. ^.^

EDIT: I see it removes the context map. The map itself was definitely a bit of the slow down in mine and I really wanted to remove it, but it made it just so convenient to use. How can state be stored and matched against?

Like the examples at:

How would all of those be done in nimble?

josevalim · March 2, 2018, 11:49pm

We don’t give the byte location. We do give UTF-8 line and column.

Yes. We also build an error message from labels and the combinators themselves.

I am not quite sure what you mean but if you annotate each token properly, you have all of the information that you need that then can be reduced into a final result with reduce/3.

Thank you!

The map is not the issue as long as you guarantee the first argument is the binary and you avoid referencing it outside of a tail recursion or a pattern match.

I have no idea!

Btw, I have one question. It annoys me that the tokens emitted by the parser do not contain the line+column information. But none of the parser combinator libraries seem to do it by default. How do you tackle this in ex_spirit?

tmbb · March 3, 2018, 1:00am

Good enough for me. Actually, for my purposes there are no parse errors (Makeup lexers should consume every binary passed as an argument)

This is definitely not the same thing. You can’t parse context-sensitive languages without storing the context somewhere. You’d have to change the grammar into something that’s context-free either in a preprocessing stage or at a later stage. For example, how would you, using nimble write a parser that would detect when a line is less indented than the previous line? For example, this bit of python code:

def f(x):
    if x > 0:
        return x
    else:
        return -x

Blocks are dictated by indentation, so you need to store the indentation stack somewhere and compare it with the current indentation level. With ExSpirit, it’s trivial to write indent() and dedent() parsers that recognize this (so that you can match a block as indent() |> block() |> dedent() or something like that). AFAIK, you can’t do this if you don’t store the context somewhere. This is not a showstopper for many applications, but when you need these kinds of context-dependent parser, you really need them.

(Actually the Python parser is context-free; During a preprocessing stage, the lexer introduces some special INDENT and DEDENT tokens which can be matched by the parser as if they were matching parenthesis)

So yeah, if you apply arbitrary transformations to your string before or after parsing you don’t need the context, but in some cases it might be simpler to just keep a context.

What API do you propose for that? Doing this kind of thing by default is kinda hard… Which tokens would you want to tag? what is a token?

You don’t. ExSpirit doesn’t handle this by default. It’s trivial to define a parser combinator that does it, though. You just take the old context (before running the inner parser) and the new context (after running the inner parser), and this will give you old_position and new_position with which you can annotate the token if you wish. In ExSpirit you can do it like this: ExSpirit.Parser – ex_spirit v0.4.0

I think this does require first-class access to the context map (in your case a tuple, although a record might be even better).

tmbb · March 3, 2018, 1:03am

How hard is this to add? I can think of some uses for that (and I can elaborate on that, but first I’d like to study nimble’s internals)

tmbb · March 3, 2018, 1:09am

Could you elaborate on this? If the code compiled at runtime? This doesn’t make much sense, especially because later you claim that Nimble leaves no trace on your modules after compilation.

OvermindDL1 · March 3, 2018, 2:30am

I give a byte location as well in mine because a couple of times I needed grab a range of the binary from a before/after and that makes it O(1) instead of O(N) that just a line/column would allow for because UTF-8.

Like take the XML mini-parser in the example, it confirms the open tag matches the closing tag by holding the state (name) of the open tag. The example will fail to parse if that doesn’t match but I could, for example, default to self-creating a close tag if there is a mismatch if I so wanted. Binding holding is significantly useful for a LOT of stuff and I use it rather excessively… >.>

Compare that to just parsing out the tags straight means you have to make ‘yet another parse step’ over your output from the parser, and at that point you are basically using it as a tokenizer instead of a full modern style PEG parser.

Yeah that is how the BEAM is built, unfortunately Elixir is built to support piping into the first argument instead of the last argument, which contrasts to what I wanted in the system… ^.^;

Lol, those are very basic simple tests that, honestly, don’t use near the functionality that I needed, but at the very least they need to be supported… ^.^;

Pretty trivially. In ex_spirit I don’t do it by default because I can’t know what form the user wants it in and I HATE FURGIN HATE when a parser library only outputs in a specific format instead of the AST that I want. That is the purpose of the Spirit style, that you can parse out into the exact output format that you want all at the same time without getting those irritating multiple parsing steps from format to format especially when the ‘way’ you want to parse may change based on information that you have parsed from before ‘after’ you have massaged it into the needed format, like the XML example is the absolute barest form of an example of that.

Consequently, that is why I have the map_around and map_context_around and other various mapping forms. In the original C++ Version of Spirit this was handled via nice overloading (things like (uint_ >> ':' >> uint_) [tagger_] in ExSpirit’ese would be something like map_context(seq([uint(), lit(?:), uint_], &tagger/1)), and just like in the C++ version how you can put the [...] on any rule part, you can put the map-parts on any part in Elixir too. I come with a basic tag/3 call Ex-Spirit to wrap it in a 2-tuple (for easy migration from more leex’y style things), but I could easily make one called, oh, ast or whatever that puts them in an Elixir style 3-tuple with whatever position information is wanted, but I do not assume what form is wanted so the user should make their own function for it, which is pretty trivial to do, like the ast form the user could just define something like this that will put in the line/column and auto-set the tag name based on the defrule name:

      def ast(%{error: nil, result: result, rulestack: [name | _]} = context) do
        %{ context |
          result: {name ||, [line: context.line, column: context.column], result}
        }
      end
      def ast(context) do
        context
      end

Then just use it like:

defrule blah(
  ), fun: ast()

And of course you could pass the name into the ast() call if you wanted, or make it optional or whatever. But at this point you just add that , fun: ast() to the defrules.

Yeah for parsing elixir that would entirely be fine.

Yep, this is the state. Like knowing whether you are parsing an expression that resolve to an integer compared to a string may change how you parse next, this is the requirement of state.

Which of course you can’t do if you don’t control the language that you are parsing… ^.^;

This is also the same thing as the XML parser example I linked in my prior post.

And this was a big reason C++ Spirit was created, and why I use it so much.

Yeah the official python parser has multiple stages. To do it all in one stage you have to hold the context somewhere.

Ah yep here, cannot assume a form of how you want it to output. Like C++'s and My Spirit libraries can output everything from a literal token stream just like any other tokanizer all the way up to an AST to even performing, say, interpretation of a language as it runs it (say a lisp’y language). One of the examples of the C++ Spirit is writing a calculator, first they write it as an AST, then they wrote it again where it actually calculates the output ‘as’ it parses, so the final result is not an AST but rather the actual answer, no AST is needed, nothing needs to be held but a simple number at each recursion and it is blazing fast in C++.

Yeah the pipe_context_around/3 is what you’d use instead if you want both the before and after contexts so you can record a ‘range’ (which is where storing the byte locations is VERY useful because then you can simply binary split it or match out the specific range to store, or just store the byte locations which is O(1) instead of O(N) of handling the lines/columns). Or using the state system you can store the byte location (or binary location and the byte to get a range) at the start of the rule and use that in the fun/1 on the defrule to grab the location information. There are many ways to handle it, all depends on whatever fits best in to your style and what you are parsing.

Yeah if you want to grab a range from the binary to hold as, say, location data, the byte is SIGNIFICANTLY faster (O(1) again) compared to just using the line/column (O(N)).

@tmbb His parser is working a LOT like my expression parser was going to work, it’d be able to generate all it needs directly, that is what his one defparsec call does, transforms the expressions into the reified code, he just saved me the work of doing it since I’ve been too busy to implement it thus far… ^.^;

I think it’s because it doesn’t leave a trace in that it doesn’t leave any calls ‘into’ his library once it’s made (well, mine doesn’t either except for the Context __struct__ name), but rather it generates efficient functions directly (where the current/old version of ExSpirit was relying on optimizations at the BEAM level instead, where the new expression style version was generating everything directly as NimbleParsec is doing).

Also blah! I forgot to git push the updated ex_spirit to github, so hex is more updated than it, and my work ssh pipe is not responding (and I’m not going to be back at work until next wednesday/thursday due to real-life reasons).

I could drop by this weekend and use my key to get in if it’s really wanted for me to upload it to github (hmm, or I could rip out the parts from the hex package as that contains all the changes too…).

OvermindDL1 · March 3, 2018, 2:41am

Also an example of holding the byte location is that it makes parsers like lexeme trivial to implement.

I’m not actually sure how to implement that parser in NimbleParsec from what I’m looking at actually… How would you implement lexeme (it’s a parser I use a LOT).

Everything currently in ExSpirit is something I use, so at the very least all of it would need to be reproduceable.

OvermindDL1 · March 3, 2018, 2:44am

Also notice that all of ExSpirit a user can replicate and write themselves if they so wish as well, it is highly extensible (which is one of the issues I was running into when I started on the expression parser, I could still make it extendable, just not near as easily for the user…).

tmbb · March 3, 2018, 3:05am

Yes, that’s eaxctly why having byte positions is useful. I might need this for makeup lexers that feature a language embedded inside the other, like EEx: I must slice the input string, feed it into different subparsers, solit the token streams and stitch them again in the rigt order. Byte offsets make splitting these strings easier.

tmbb · March 3, 2018, 3:11am

Buy writing a combinator that takes up the previous and the next context and extracts the relevant number of bytes (admittedly very hard if nimble doesn’t support byte offsets). I’m not sure you can access the previous and next context tuples (analogous to exSpirit’s map)

OvermindDL1 · March 3, 2018, 3:12am

Hmm, I’m entirely mistaken, there are a couple remote calls to the module after compilation, like when it converts a string to an integer it calls NimbleParsec.__runtime_integer__/1 as one example of actually quite a few…

So… I don’t actually know? ^.^;

tmbb · March 3, 2018, 3:13am

You have to tell the whole story here: it’s only as trivial to implement as long as all the prinitive parsers (even those that are user-defined). Remember the errors caused by a wrong implementarion of the symbols parser

OvermindDL1 · March 3, 2018, 3:16am

Also, the integer parser is… extremely limited… It seems it does base10 only, no other options? o.O

Also it seems the integer parser is not capable of negative or positive definitions (which of course may be language dependent based on what you are parsing, thus not included by default in ExSpirit either since they are easy for the user to specify them), and yet it is called ‘integer’ instead of ‘unsigned_integer’ or ‘uint’ or so, the name is not quite accurate?

Hehe, fence errors! ^.^;

Honestly the old interface of ExSpirit is very low level, once I wrote defrule I should have rewritten, oh, most of it as defrule’s perhaps, but most of the parsers existed before defrule existed, sooo… >.>

OvermindDL1 · March 3, 2018, 3:45am

Also another note, ex_spirit is built to, just like in C++'s version, handle more than just binaries, that just happens to be what the Text submodule focuses on, but the base parsers could all handle lists (like say of tokens) or a custom format or whatever as well. It doesn’t look like Nimble can support anything but binaries. Which is fine for me currently as I’ve not needed the other functionality yet (though it is a possibility in the future, hence why I left it open).

Also, is there a reason for the ‘literal’ thing? It’s very noisy when it seems that just a literal thing in code would be far easier to read (which is what my expression parser builder was going to support instead of needing lit(...) everywhere). I know it’s for easy piping, but a better wrapper around that would work as well.

Also parsec/1/2 creates a stacktrace entry, but so do the things like choice too. Also couldn’t it be optimized to call straight into the parser anyway, bypassing the caller function?

Overall it looks cool, but it needs a couple changes, like integer being more properly named, handling more bases, a way to extend the parser easily, a way to handle state, byte handling, and a few dozen helper parser bits.

josevalim · March 3, 2018, 7:51am

I see. We do keep a stack where we can store contexts internally, so if this is needed, we can support it. Just open up an issue with the API you have in mind and we can discuss it. I actually think the traverse combinator can be generalized into a context combinator.

Think about it like quoted expressions. You can compose and change as if they were runtime components but then you pass it to macros like def which compile them down. So while you are composing, you do need the library, but after you call defparsec you do not.

No, those calls happen at compilation time. The generated code doesn’t call NimbleParsec at all.

Not hard at all. I did not think about the use of bytes to extract offsets. I am wondering if there is a simpler schema that doesn’t require me to keep all of line, column and offset. There is also the issue that we can’t properly count UTF-8 columns when parsing codepoint by codepoint. So maybe I will keep the line as {line, last_line_byte} and the current_byte. Column can be properly be derived from those two.

Yes, the focus is on binaries because the goal is to explicitly leverage binary matching.

I agree. I will remove literal in favor of just passing the binary.

The problem with parsec/2 is not the call but the fact we need to handle the returned value and resume the callee. Also note choice/2 in nimble does not necessarily create a stacktrace entry if we know that all choices are bound patterns.

Thanks! integer will likely be kept as is but I will clarify the docs. And we don’t plan to handle more bases. As I said earlier, integer is built on top of existing primitives, so you can easily support other bases and make it as performant as the built-in integer.

A lot of my focus on the library was exactly in defining the low-level primitives, so I want to focus on capabilities rather than adding many different helpers. If we solve the former, the latter is simple and fast.

So I am very interested on constructs we don’t support yet so we can extract the underlying primitives out of them.

dom · March 3, 2018, 9:02am

I think the “first argument” requirement is going away in 21

github.com/erlang/otp

sys_core_bsm: Rearrange arguments to enable delayed sub binary creation

erlang:master ← bjorng:bjorn/compiler/generalize-bsm/OTP-14774

opened 12:11PM - 17 Jan 18 UTC

bjorng

+88 -134

Argument order can prevent the delayed sub binary creation. Here is an example …directly from the Efficiency Guide: non_opt_eq([H|T1], <<H,T2/binary>>) -> non_opt_eq(T1, T2); non_opt_eq([_|_], <<_,_/binary>>) -> false; non_opt_eq([], <<>>) -> true. When compiling with the `bin_opt_info` option, there will be a suggestion to change the argument order. It turns out `sys_core_bsm` can itself change the order, not the order of the arguments of themselves, but the order in which the arguments are matched. Here is how it can be rewritten in pseudo Core Erlang code: non_opt_eq(Arg1, Arg2) -> case < Arg2,Arg1 > of < <<H1,T2/binary>>, [H2|T1] > when H1 =:= H2 -> non_opt_eq(T1, T2); < <<_,T2/binary>ffff>, [_|T1] > -> false; < <<>>, [] >> -> true end. When rewritten like this, the `bs_start_match2` instruction will be the first instruction in the function and it will be possible to store the match context in the same register as the binary (`{x,1}` in this case) and to delay the creation of sub binaries. The switching of matching order also enables many other simplifications in `sys_core_bsm`, since there is no longer any need to pass the position of the pattern as an argument. We will update the Efficiency Guide in a separate branch before the release of OTP 21.

tmbb · March 3, 2018, 12:36pm

I don’t think you do, but that’s my fault. I haven’t explained myself well. From what I gather (haven’t had much time to read the source), you store a parser stack. That’s not remotely enough for what I’m talking about. To parse context-sensitive languages I need a place where I can store arbitrary data and retrieve or modify it later by a later parser. Your combinators seem to operate on tuples of the form:

{:ok, acc, stack, rest, line, column}

I’d need a tuple like this:

{:ok, acc, stack, state, rest, line, column}

And I’d need to get and set the state from inside my parsers. The simples example is XML, like @OvermindDL1, said, in which you need to match <x>...</x>. You need to keep a stack of XML tags, and that stack needs to be get and set by the user. The implementation in ExSpirit is a little more complex, because there is both a global state and a local state, but I think a global stats should be enough.

Also, for the reasons @OvermindDL1 and I outlined above, I’d like a byte_position value:

{:ok, acc, stack, rest, byte_position, line, column}

This starts to get a little hard to pattern match on, so you should give some serious thought to the idea of making it a record, which would act much closer to a struct and would be easier to pattern match on.

Because your combinators don’t have access to the full context, I don’t think you can support it without a major redesign. t’s totally ok if you want to only support parsing context-free languages, but if nimble is supposed to be comparable to ExSpirit it needs to give the users some access to the full context. Which, as I said above, makes it trivial to tag tokens with their source location (whatever your tokens are)

OvermindDL1 · March 3, 2018, 6:43pm

Ah I see, it returns AST without being a macro, cool.

Whooo!!

Ah true though that almost never the case in what I parse, the choices can be quite… complex… ^.^;

Also, is there anything that cancels backtracking so once something is parsed you know it must be that after or it is a failure (like my expect parser)? Without it, it is easy to get into some really bad worst-case parsing times with PEG parser.

Hmm, how might that look? I use a lot of bases at times and what base to use is calculated based on what was previously parsed (passed in via state) rather than being hardcoded variants.

State, adjusting state, passing state ‘as’ arguments to parsers, etc… That is the biggest one by far as without it some languages become extremely difficult to parse if not outright impossible to do correctly.

Very cool!

I really really like local state, makes things very simple. I don’t actually use global state at all and instead pass state ‘up’ via arguments and ‘down’ packed into results.

tmbb · March 3, 2018, 8:55pm

How would you maintain an indentation stack with only local state? You need all indent levels lower than the previous one. Would you push the stack into local state every time?