NimbleParsec - a simple and fast parser combinator for Elixir

It might be worthwhile to attempt to switch the order of the parsers, such that extlang will only be used when there really are only three characters.

I appreciate the suggestion - and I’ve tried all manner of ordering without success. I can’t seen a way to avoid needing some kind of forcing backtracking (failing a parser in my post above).

A very quick suggestion is to unfold the ABNF grammar into complete combinators and then refactor it. IIRC ABNF is able of tracking back multiple levels but nimble isn’t, so you need to restructure the problem. The math example shows how the solutions end-up a bit different.

The other thing I noticed is that you are excessively relying on parsec. You should compose on the helpers as much as possible and fallback to combinators only if compilation times are high.

1 Like

I second @josevalim here. If you use regular combinators, the parser will be measurably faster, although it will be slower to compile. By using parsec excessively you WILL decrease runtime performance (although it might be worh it if compile times are too high).

Appreciate the feedback @josevalim and @tmbb. Moved everything to helpers except the entry point. Shows about 30% performance improvement (in a non scientific timer test). I’ll try some grammar unrolling this weekend to see if I can work around my original issue.

1 Like

I have written a small library to make such tests as scientific as possible and as easy as possible. It’s called schism, and it’s available on Github here (no hex package yet): GitHub - tmbb/schism: A library that makes it easy to have alternative implementation of functions for benchmarks

There is a post discussing it here in the forum: Schism - a library to make benchmarking easy (and heretical!)

Basically, it uses some macros to define alternative versions of code fragments (function definitions, parts of function definitions, etc). It then provides some utilities to make it easier to benchmark the changes with Benchee.

You use it like this:

defmodule YourModule do
  # Some code, don't need to touch here
  # The code here is the same for both versions
  schism "inline vs parsec" do
    dogma "parsec" do
      # single combinator, coded in such a way that it uses parsec
    end

    heresy "inline" do
      # same combinator, but avoiding parsec
    end
  end
  # some more code..
  # The code here is the same for both versions
end

With benchee you can then compare the dogma and the heresy and determine objectively what is faster. This is obviously something you can do yourself, but the beauty of it is that it provides a very simple API with almost no friction to test the impact of small changes in single functions on the performance of the whole application, without having to maintain two copies of the application.

EDIT: are you afraid the two implementations of the function are not compatible? Don’t worry, the defsnippet macro in the Schism package handles that situation pretty well. For what it’s worth, Schism was originally written to benchmark alternative implementations of nimble_parsec parsers.

4 Likes

Thats very cool and I can’t believe I missed it when you posted in April. Thanks for the assist! And now back to grammar unrolling 
 :slight_smile:

I ran a benchee example of my old parser and the new one using nimble_parsec with only one parsec/1 call. The old one used an ABNF parser (and the even older one used leex and yecc). Really positive results:

Old version: Average of 163.87 ÎŒs
Nimble_parsec version: Average of 28.40 ÎŒs

I think ~6 times faster was worth the evenings work :slight_smile: . Thanks again @josevalim and @tmbb

1 Like

By the way, did you end up using Schism?

Not this time - the dependencies are different (abnf2 which I maintain and nimble_parsec). But for sure its going to be a standard part of my performance improvement activities from now on.

Note: I initially wrote this as an issue on the repo, but I thought it might be better served by being a forum post, as it turned out quite long, and may invite discussion.

@josevalim, let me first start by saying thanks for making this library. I really like the combination of just creating a data structure describing the parser using functions (thus avoiding excessive macro magic and making things as composable as possible), but then—instead of interpreting that structure at runtime—compiling it into efficient code.

While this isn’t explicitly mentioned anywhere in the project docs, the overall style of the library plus the reference to Parsec suggests that the underlying idea (whether originally intentional or not) is to do PEG-style parsing. Even if it’s not an exact implementation of PEG, it would be reasonable to expect that NimbleParsec should be equal in “parsing power”, if you will, to a PEG parser—that is, admit any languages which a PEG parser does.

If you look, for instance, at Wikipedia’s definition of a PEG, you will see the following combinators mentioned as fundamental operators (you can technically get away with fewer, but it starts to get impractical to implement), and most of these do resemble constructs NimbleParsec has:

  • Sequence—either concat/2 or the first argument of a combinator
  • Ordered choice—choice/2 implements this
  • Zero-or-more—repeat/2
  • One-or-more—times/3 with min: 1

However, then you come to these:

  • And-predicate (&)
  • Not-predicate (!)

These are the two predicates which denote positive and negative lookahead. Put simply, & asserts that the parser it wraps matches, and ! asserts that the parser it wraps does not match. Crucially, they don’t consume any input. Say that we call them require and deny for the purposes of NimbleParsec. That means that require(string("foo")) |> utf8_string([], min: 3) would parse in any string starting in “foo”, and deny(utf8_char([?0..?9])) |> utf8_string([], min: 1) would parse any string not starting with a digit.

N.B. technically you only need !, since for any parser p, !!p == &p in behaviour, but explicitly implementing & is likely to be more efficient.

Why is proper lookahead support needed?

I know most requests of the form “we should implement a new combinator in NimbleParsec” are met with the response of “this library is supposed to be barebones, and you should make your own combinators”, as well as “what is the particular usecase”? Therefore, I aim to show a couple of grammars which I believe NimbleParsec cannot currently parse properly, but which would be useful to support, and how they could be supported if require and unless are implemented.

N.B. I am not tied to the names require/2 and deny/2—they were the first I though of which seemed to fit in with the existing naming theme.

Keywords and identifiers

Let’s imagine we’re parsing a super simple programming language. For now, we only need to care about keywords (let’s say if and while) and identifiers (let’s say any string at all in the range of [a-zA-Z0-9]).

Great, let’s get started! We go ahead and write some code like this:

defmodule NPL.Helpers do
  import NimbleParsec

  def keyword(c \\ empty()),
    do:
      c
      |> choice([
        string("if") |> replace(:if),
        string("while") |> replace(:while)
      ])
      |> tag(:kw)

  def identifier(c \\ empty()), do: c |> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1) |> tag(:id)

  def expr_(),
    do:
      choice([
        keyword(),
        identifier()
      ])
      |> optional(string(" ") |> ignore() |> expr())

  def expr(c \\ empty()), do: c |> parsec(:expr)
end

defmodule NPL do
  import NimbleParsec
  import NPL.Helpers

  defparsec(
    :expr,
    expr_()
  )
end

Let’s test it:

iex(1)> import NPL
NPL
iex(2)> expr "if"
{:ok, [kw: [:if]], "", %{}, {1, 0}, 2}
iex(3)> expr "foo"
{:ok, [id: ["foo"]], "", %{}, {1, 0}, 3}
iex(4)> expr "foo if"
{:ok, [id: ["foo"], kw: [:if]], "", %{}, {1, 0}, 6}

Seems like it works fine!

iex(5)> expr "iffy"  
{:ok, [kw: [:if]], "fy", %{}, {1, 0}, 2}

Uh
 no, we wanted that to just be identifier. Now we have a keyboard and a failed parse for the rest of the input. Maybe if we switch the order of the choice around?

def expr_(),
    do:
      choice([
        identifier(),
        keyword()
      ])
      |> optional(string(" ") |> ignore() |> expr())
iex(2)> expr "iffy"
{:ok, [id: ["iffy"]], "", %{}, {1, 0}, 4}
iex(3)> expr "if"  
{:ok, [id: ["if"]], "", %{}, {1, 0}, 2}

Now identifiers which have a keyword prefix are parsed fine, but we no longer recognise keywords as keywords!

Units

You might say “NimbleParsec was really designed for much simpler parsing—language codes, CSV maybe, simple formats, not full on languages!” Fine, let’s tackle a much simpler problem: parsing measurement units. Let’s define a parser like this:

defmodule NPT.Helpers do
  import NimbleParsec

  def metric(c \\ empty()),
    do:
      c
      |> choice([
        string("mol"),
        string("l"),
        string("m"),
        string("g")
      ])
      |> label("metric base unit")

  def datetime(c \\ empty()),
    do:
      c
      |> choice([
        string("yr"),
        string("mo"),
        string("wk"),
        string("d"),
        string("h")
      ])
      |> label("temporal unit")

  def prefix(c \\ empty()),
    do:
      c
      |> choice([
        string("m"),
        string("k"),
        string("c")
      ])
      |> label("metric prefix")

  def unit(c \\ empty()),
    do:
      choice([
        datetime() |> tag(:datetime),
        metric() |> tag(:metric),
        prefix() |> tag(:prefix) |> metric() |> tag(:metric)
      ])
end

defmodule NPT do
  import NimbleParsec
  import NPT.Helpers

  defparsec(
    :parse_unit,
    unit()
  )
end
iex(1)> parse_unit "m"
{:ok, [metric: ["m"]], "", %{}, {1, 0}, 1}
iex(2)> parse_unit "mol"
{:ok, [datetime: ["mo"]], "l", %{}, {1, 0}, 2}
iex(3)> parse_unit "ml" 
{:ok, [metric: ["m"]], "l", %{}, {1, 0}, 1}

It’s the same issue!

It turns out that PEG parsers (which is how NimbleParsec is acting), in contrast to, for instance, regular expressions, are greedy and don’t backtrack. If we have a choice between several underlying parsers, and one of them parses a prefix of the other, we will always have issues. PEG gives us a great deal of control over how exactly parsing takes place, but this also requires us to be very specific where the parser needs to be careful—that is, where it needs to check ahead without consuming any input, so it can make a decision on how to consume further input (PEGs are linear, so once the output is consumed, there’s no going back! In NimbleParsec, once the output is consumed, we have tail-recursed and there is no way to go back!)

Why is lookahead/2 not enough?

This issue has been mentioned on the repo, and as a response, the lookahead/2 function has been implemented. Let’s try fixing our keyword parser using the function, and let’s add a couple more parsecs just for the individual types of objects we have in our grammar, to help with debugging.

Our goal here is to try to say “when we’re trying to parse an identifier, first check if the next token up to a space is a keyword. If so, fail the parse (with the expectation that when in a choice block, this would cause the next choice to be taken).”

...
  def identifier(c \\ empty()),
    do:
      c
      |> lookahead(:fail_on_keyword)
      |> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1)
      |> tag(:id)
...
  def fail_on_keyword("if " <> _, _, _, _), do: {:error, "if found"}
  def fail_on_keyword("while " <> _, _, _, _), do: {:error, "if found"}
  def fail_on_keyword(_, ctx, _, _), do: {[], ctx}
...
  defparsec(
    :keyword,
    keyword()
  )

  defparsec(
    :identifier,
    identifier()
  )

Right off the bat, we see one immediate issue with lookahead/2: it’s not a composable combinator, but rather some sort of halfway measure, where we need to define a separate function for each little thing we want to check. We also need to ensure this extra function is accessible in the module which defines the parsec! We can’t just easily re-use our existing combinators, we’d need to have a separate function call and explicitly call another parsec within it, and return :error or :ok based on whether it succeeds. That’s quite unergonomic, but perhaps could be solved to some degree with a quoted_lookahead. Nevertheless, let’s try our new solution out:

iex(2)> keyword "if"
{:ok, [kw: [:if]], "", %{}, {1, 0}, 2}
iex(3)> identifier "iffy"
{:ok, [id: ["iffy"]], "", %{}, {1, 0}, 4}
iex(4)> identifier "if " 
{:error, "if found", "if ", %{}, {1, 0}, 0}

Seems like our individual parsers are working ok. Now, let’s try their combination:

iex(6)> expr "if foo"
{:error, "if found", "if foo", %{}, {1, 0}, 0}

And no, this doesn’t work. Even if you ignore the ergonomic issues around lookahead, its fundamental shortcoming is the fact that it doesn’t compose—when it fails inside a choice or repeat block, the failure leaks to the top level of the parser, instead of causing the parent parser to stop looking at that branch and continue.

Proposed solution

Revisiting the keyword example one more time, I think that you should be able to write something like this:

  def id_char(c \\ empty()), do: c |> utf8_char([?a..?z, ?A..?Z, ?0..?9])
  
  def keyword(c \\ empty()),
    do:
      c
      |> choice([
        string("if") |> replace(:if),
        string("while") |> replace(:while)
      ])
      |> deny(id_char())
      |> tag(:kw)

  # ignore things like actually joining this into a string
  def identifier(c \\ empty()),
    do:
      c
      |> times(id_char(), min: 1)
      |> tag(:id)

In fact, adding deny (and having repeat handle it properly) would allow repeat_until to be removed, since it could be written like this:

  def repeat_until(c \\ empty(), to_repeat, until),
    do:
      c
      |> repeat(
        deny(until())
        |> to_repeat()
      )

As for implementation, it seems to me reading through the compiler code that failed parsers are already specially handled in some cases, but not all (I’m referring to the catch_all piece of config which is passed around. I believe that for all parsers (bound and unbound), both repeat and choice should respect the failure or success of them. Implementing require and deny will require making a stack frame at the point they are invoked (since we need to be able to get back to the state of the binary, accumulator etc right before we called them), which is unavoidable, but we still retain great efficiency and tail recursion for the majority of the code (we only make extra stack frames in specific places, unlike a full backtracking parser, which needs to keep the full stack around).

Please let me know what you think about this—I’m pretty certain that what I described cannot currently be achieved generally with NimbleParsec, but if I misunderstood how it works somewhere, please let me know. I think implementing this feature would greatly aid the library on its way to be a powerful set of primitives with which fully-featured parsers can be built.

Side note: failures vs errors

While not necessary for more parsing power, it would be useful to distinguish between failures (this parser failed, but it it’s in a choice or repeat then try the next choice / stop the repeat) and errors (something unrecoverable was seen, fail right away and show the user an error) in order to allow better error messages to be reported back to the users. Of course, this would need to come after all errors are actually handled consistently as proposed above, and is probably a separate discussion. This paper gives a good overview of how some other parsers handle this and what advantages it has.

3 Likes

I agree, and although NimbleParsec has replaced a little bit of my ExSpirit library usages, but I still use ExSpirit on occasion just because NimbleParsec can’t handle some things as well (especially around state information).

The issues in your post are for NimbleParsec supposed to be done by a secondary pass, it’s not a single-pass parsing library (which loses a lot of it’s capabilities).

I’m sure Jose would be up for accepting PR’s to integrate the features that my ExSpirit library has, the state handling capabilities would be a HUGE boon for me and I could drop ExSpirit if it’s features were ported over. :slight_smile:

For note, that same example in ExSpirit (give or take, should actually be parsing out whatever comes after if/while):

iex(1)> defmodule Testing do
...(1)>   use ExSpirit.Parser, text: true
...(1)> 
...(1)>   defrule identifier(tag(:id, chars([?a..?z, ?A..?Z, ?0..?9], 1)))
...(1)>   
...(1)>   defrule keywords(context) do
...(1)>     symbols_ =
...(1)>       ExSpirit.TreeMap.new()
...(1)>       |> ExSpirit.TreeMap.add_text("if", :if) # Value can be a function to return a parser
...(1)>       |> ExSpirit.TreeMap.add_text("while", :while) # Value can be a function to return a parser
...(1)>     context |> symbols(symbols_) |> lookahead_not(identifier()) # Should actually parse the if/while instead...
...(1)>   end
...(1)>   
...(1)>   defrule expr(alt([
...(1)>     keywords(),
...(1)>     identifier(),
...(1)>   ]))
...(1)> end
{:module, Testing,
 <<70, 79, 82, 49, 0, 0, 127, 188, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 3, 7,
   0, 0, 0, 84, 14, 69, 108, 105, 120, 105, 114, 46, 84, 101, 115, 116, 105,                                                          
   110, 103, 8, 95, 95, 105, 110, 102, 111, 95, ...>>, {:expr, 1}}
iex(2)> import ExSpirit.Parser
ExSpirit.Parser
iex(3)> parse("if", Testing.identifier)
%ExSpirit.Parser.Context{
  column: 3,
  error: nil,
  filename: "<unknown>",
  line: 1,
  position: 2,
  rest: "",
  result: {:id, "if"},
  rulestack: [],
  skipper: nil,
  state: %{},
  userdata: nil
}
iex(4)> parse("if", Testing.identifier).result
{:id, "if"}
iex(5)> parse("iffy", Testing.identifier).result
{:id, "iffy"}
iex(6)> parse("if", Testing.expr).result
:if
iex(7)> parse("iffy", Testing.expr).result
{:id, "iffy"}

I should probably just finish my ExSpirit.Parserx rewrite (which started just before NimbleParsec came out) as it is faster.

Thank you @mjadczak. This was a very interesting and entertaining read!

NimbleParsec was never meant to be a complete parsec but it doesn’t mean it can’t be one as long as it fits within its goals.

You are correct that today lookahead can only “error” and I would probably keep the error semantics. It is also worth saying that lookahead is implemented on top of traverse. Maybe we can introduce {:deny, ...} to traverse, at least as a starting point, and that should make certain designs possible.

Here is a diff that allows just that:

diff --git a/lib/nimble_parsec/compiler.ex b/lib/nimble_parsec/compiler.ex
index 842ab6f..c0c0233 100644
--- a/lib/nimble_parsec/compiler.ex
+++ b/lib/nimble_parsec/compiler.ex
@@ -277,12 +277,24 @@ defmodule NimbleParsec.Compiler do
     quote(do: unquote(next)(rest, acc, stack, context, line, offset))
   end

-  defp traverse(traversal, next, rest, user_acc, context, line, offset, _) do
+  defp traverse(traversal, next, rest, user_acc, context, line, offset, config) do
     case apply_traverse(traversal, rest, user_acc, context, line, offset) do
-      {user_acc, ^context} when user_acc != :error ->
+      {user_acc, ^context} when not is_atom(user_acc) ->
         quote(do: unquote(next)(rest, unquote(user_acc) ++ acc, stack, context, line, offset))

       quoted ->
+        deny =
+          case config do
+            %{catch_all: nil} ->
+              quote do: {:error, reason, rest, context, line, offset}
+
+            %{catch_all: next, acc_depth: n} ->
+              {_, _, _, body} = build_proxy_to(:unused, next, n)
+              body
+          end
+
         quote do
           case unquote(quoted) do
             {user_acc, context} when is_list(user_acc) ->
@@ -290,6 +302,9 @@ defmodule NimbleParsec.Compiler do

             {:error, reason} ->
               {:error, reason, rest, context, line, offset}
+
+            {:deny, reason} ->
+              unquote(deny)
           end
         end
     end
@@ -299,7 +314,8 @@ defmodule NimbleParsec.Compiler do
     apply_traverse(Enum.reverse(mfargs), rest, {acc, context}, line, offset)
   end

-  defp apply_traverse([mfargs | tail], rest, {acc, context}, line, offset) when acc != :error do
+  defp apply_traverse([mfargs | tail], rest, {acc, context}, line, offset)
+       when not is_atom(acc) do
     acc_context = apply_mfa(mfargs, [rest, acc, context, line, offset])
     apply_traverse(tail, rest, acc_context, line, offset)
   end
@@ -674,7 +690,7 @@ defmodule NimbleParsec.Compiler do
         {traverse_line, traverse_offset} = pre_post_traverse(kind, pre_metadata, post_metadata)

         case apply_traverse(mfargs, rest, outputs, context, traverse_line, traverse_offset) do
-          {outputs, ^context} when outputs != :error ->
+          {outputs, ^context} when not is_atom(outputs) ->
             {:ok, inputs, guards, outputs, post_metadata}

           _ ->

The diff above will allow you to return {:deny, "not an identifier"} from lookahead/traverse. The reason is only used fi you call deny outside of something you cannot escape. Once I do it, this works as expected:

defmodule NPL.Helpers do
  import NimbleParsec

  def keyword(c \\ empty()),
    do:
      c
      |> choice([
        string("if") |> replace(:if),
        string("while") |> replace(:while)
      ])
      |> lookahead(:fail_on_identifier)
      |> tag(:kw)

  def identifier(c \\ empty()),
    do:
      c
      |> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1)
      |> tag(:id)

  def fail_on_identifier(<<a, _::binary>>, _, _, _) when a in ?a..?z, do: {:deny, "identifier"}
  def fail_on_identifier(_, ctx, _, _), do: {[], ctx}

  def expr_(),
    do:
      choice([
        keyword(),
        identifier()
      ])
      |> optional(string(" ") |> ignore() |> expr())

  def expr(c \\ empty()), do: c |> parsec(:expr)
end

defmodule NPL do
  import NimbleParsec
  import NPL.Helpers

  defparsec(
    :expr,
    expr_()
  )
end

IO.inspect NPL.expr "iffy"  

I agree there are still “logistics” issues. Having to write lookahead for everything is not nice but it should provide a starting point so we can generalize it into a proper combinator.

3 Likes

I’m not certain that you could do this at all with the (as of writing) NimbleParsec, even with multiple passes. Do you have a quick example of how you could do it in two passes? It seems to me that things like supporting state (which are really just “what kind of side computation can we attach to the parsing logic”) are different from lookahead (which is a fundamental block of the parser). ExSpirit may have both, but the fundamental difference here is that you have lookahead_not IMO.

Thanks! I’m going to play around based off that diff, but good to see it can fit somewhat nicely in the existing infrastucture.

I do feel quite strongly that lookahead should be a fundamental combinator rather than a “call to user code” function. It seems like even with your patch, in order to use a pre-existing parser (and I mean just a value or a function, not a parsec/combinator you would have to wrap that parser into a defparsec so that you could call it from the auxiliary traverse function.

I also think that having this :deny form be a special case just for this lookahead usecase also hurts composability, which I understood was a core goal of the library—it would be nice if I could take any parser at all, which returned an :error when run by itself, and stick it into :deny, turning it into a success. Once again I’ll reiterate that from a CS theory point of view, this is a fundamental combinator in a strict sense of “there are grammars you cannot parse without it”, which is not the case even for things like repeat!

Thanks again though, I’m looking forward to playing around some more with this idea and seeing if I can come up with a nice way to integrate it into the existing structure.

You could be treating it basically as a tokanizer, which does relegate it’s functionality a bit but that’s mostly how it’s been built to be


Yep, that is just the biggest feature that I routinely use from ExSpirit that NimbleParsec doesn’t have, I can mostly work around the rest, mostly (mine contains a lot more result information that I use quite a bit that NimbleParsec just doesn’t have).

Actually everything mine has is built it from a base set of things, and lookahead/lookahead_not you can implement entirely in user-code in ExSpirit. NimbleParsec generates code in comparison so what it can do is limited to what it’s code generators know about. Perhaps with some good injection points then people could write their own generators
?

That all comes back to how NimbleParsec generates it’s fast code. PR’s can extend it though! :slight_smile:

Despite being the instigator of the lookadhead/2 combinator, I have to say I was unaware of how it had been implemented. Like you, I don’t like the fact that this combinator doesn’t take a parser as argument. This is very important information for my future uses of NimbleParsec.

I think that NimbleParsec should be very clear on what it supports and what it doesn’t support. That is important for users to decide whether the library fits their needs. Maybe someone with experience with grammars could spell out how powerful NimbleParsec is exactly? As we’ve seen, it’s not a PEG parser, but it might be something slightly weaker but still very powerful.

I very disagree very strongly with this idea. For a parser, it’s very important to decide ahead of time what it’s capabilities are. Adding features by accretion doesn’t seem like a good way to do this. (IMO) You shouldn’t add features one by one. You should pick a parsing style (Regex, PEG, Context-sensitive [spirit-like]) and then implement it. In stages, if necessary, but with the model picked ahead of time.

It’s totally OK if NimbleParsec doesn’t become any more powerful than it is because that would compromise the main design goal (which is to be crazy fast), but the capabilities of a possible new version should be decided as soon as possible.

1 Like

Look at the elixir lexer in my makeup project. I parse everything as a keyword and then in a second pass highlight some keywords in a different way. But the initial step treats “if” and “while” just like “a” and “b”. On my phone so no links, sorry.

1 Like

Wait what? I’ve not seen it yet in NimbleParsec, how does it work without a combinator to give it a success/fail result?! o.O

You have to supply a user-defined function (which needs to be a real def in the module) which takes the binary and has to return a success or error.