NimbleParsec - a simple and fast parser combinator for Elixir

tmbb · December 14, 2018, 9:39pm

It is actually very easy to do. Just wrap the combinator definitions (i.e. var = expr) inside a macro which can annotate the values on the left side based on the line of the match operator.

I’ve written a macro to do just that, but I can’t connect my laptop to the sketchy wifi I’m using now…

The only change you need is a metadata field on the tuples returned by the combinators so that I can add the line number and the rule name (which is simply the variable name on the left side of the match operator)

Then you can say: invalid argument to combiinator #{name} defined in line #{line}

tmbb · December 14, 2018, 11:24pm

The code:

defmodule MyLib do
  def get_line(meta) do
    Keyword.get(meta, :line)
  end

  def annotate_parser_ast(ast, line, rule) do
    # This is stupid because we assume the input is a list of pairs,
    # while the output is a list of 3-tuples.
    # In the real version, the input will also be a list of 3-tuples.
    for {key, val} <- ast do
      {key, %{line: line, rule: rule}, val}
    end
  end

  # If this is a simple assignment, annotate the parser AST with location information
  defp annotate_quoted_combinator({:=, _meta, [{rule_name, meta, nil} = lhs, rhs]}) do
    line = get_line(meta)

    quote do
      unquote(lhs) = MyLib.annotate_parser_ast(unquote(rhs), unquote(line), unquote(rule_name))
    end
  end

  # Else do nothing
  defp annotate_quoted_combinator(something_else) do
    something_else
  end

  defmacro combinators(do: {:__block__, _meta, exprs}) do
    annotated_exprs = Enum.map(exprs, &annotate_quoted_combinator/1)

    quote do
      (unquote_splicing(annotated_exprs))
    end
  end
end

defmodule MyParser do
  import MyLib
  import NimbleParsec

  combinators do
    comb1 = string("abc")

    comb2 = string("xyz")

    comb3 =
      choice([
        comb1,
        comb2
      ])
  end

  IO.inspect(comb1, label: "comb1")
  IO.inspect(comb2, label: "comb2")
  IO.inspect(comb3, label: "comb3")
end

When you compile this code, you’ll get the following output:

Compiling 1 file (.ex)
comb1: [{:string, %{line: 43, rule: :comb1}, "abc"}]
comb2: [{:string, %{line: 45, rule: :comb2}, "xyz"}]
comb3: [
  {:choice, %{line: 47, rule: :comb3},
   [
     [{:string, %{line: 43, rule: :comb1}, "abc"}],
     [{:string, %{line: 45, rule: :comb2}, "xyz"}]
   ]}
]

Basically this adds a metadata field to the combinators. That way, if there is a problem, the compiler used by defparsec can give a nice error message containing the line number and the name of the “rule” (I’m stealing the name from ExSpirit). For this to work, the metadata field must be added to the combinators (I haven’t read enough of the nimble_parsec implementation to see if it this would be feasible, but it looks like it should be.

The idea is that the combinators macro just fills the metada in the parser AST. It does so by cleverly mixing functions and macros to control which parts of the (elixir) AST are evaluated and each ones are retuned as is.

tmbb · December 15, 2018, 10:55am

Well, you could accept mfa tuples, or single atoms and treat the rest as a quoted expression. That way one could write from_context(key, quote(do: fn x -> ... end))

It would be very convenient and I don’t mind wrapping it in a quote.

sasajuric · December 17, 2018, 4:23pm

Our parser is currently one of the most stable parts of our code. It works well for us, and I don’t see the need to switch to another parser lib (even though I’m not completely happy with combine). That said, if we were to switch to something else, I’d explore other venues, most notably rolling our own parser library, instead of using ex_spirit or nimble_parsec. There are two main reasons for this.

First, parsing speed is absolutely irrelevant for us. We spend much more time executing the query than parsing it. We probably wouldn’t care if our parser is 100x or even 1000x slower.

Another thing is that we’re building AST from tokens, not from bytes or characters. Originally we started by doing a single-pass parsing (i.e. tokenizing and building ast in the same pass), but we quickly found out that this is significantly complicating the parser code. So we split this into two passes, which makes the parser code much easier to follow. Since parsing speed is irrelevant for us, we can afford the luxury of two passes.

My impression is that most available parser libraries today are designed for fast parsing of binary input. That’s definitely a valid target, but it’s not what we need. Therefore, it seems to me that the trade-offs made in such libraries are irrelevant, if not even counter-productive for our particular case. For example, we don’t really need to make any decisions at compile-time, so we don’t care about defparser or similar macros at all. In fact, for our case I’d prefer to have a vanilla combinator which only uses plain functions, not macros.

Most notably, if a parser is forcing us to deal with string input, we probably wouldn’t use it.

tmbb · December 17, 2018, 6:50pm

ExSpirit is Great when used as a token parser. You just have to implement a custom combinator, say, satisfies?(predicate) which tests whether the token satisfies a certain predicate or not. I’ve used as a token parser once and it was pretty cool.

OvermindDL1 · December 17, 2018, 7:03pm

That’s actually a pattern that the C++ Spirit supports (it can accept any kind of iterable, not just strings), and my ExSpirit follows the same conventions. The only things that assume strings are the things in the Text module. It would be very trivial to create a, oh, Matcher module or token matching (you can create it in user code with ease, a single function would probably work very well, I think C++'s token parsing uses just a single terminal to handle it all). The main ExSpirit library is agnostic to what is actually being parsed (ignoring a few things like eoi and lexemethat I really should protocolize or at least handle a few erlang-style forms of). But parsing non-strings was entirely in mind when it was designed.

It absolutely does in most parsers, but you really should try the C++ Spirit design, many modern PEG libraries have copied it’s style for a reason.

It’s not always what I need either, like the C++ Spirit library I’ve used to parse bytecode streams, walk and decompose trees, etc… It is an extremely powerful data transformation library from something that is iterable to something that is structured and vice-versa (so many people miss the nice vice-versa part).

Yep this! It’s really easy to define your own terminals in user-code.

tmbb · December 17, 2018, 7:07pm

Maybe it’s the lack of a skipper parser?

OvermindDL1 · December 17, 2018, 7:17pm

That is such a huge thing that it astounds me that not all language parsers don’t already support such a thing! o.O

It’s not just the only thing, but yeah, I’d say it’s probably a good 70% of it easy.

mjadczak · December 18, 2018, 10:07pm

For what it’s worth, I’ve been reading through the Spirit docs, and I actually think that there are relatively few core, primitive features missing from NimbleParsec to start approaching that level of functionality (modulo making some particular substructures more efficient, but I think that should come later). Especially once you start looking at X3 and how it uses C++ lambdas for its semantic actions, and how these “context sensitive” parsers are just a special case of the lazy parser (which generates a new parser at runtime), it slowly starts clicking into place. I’m currently knocking the rust off my C++ skills (there’s a Rust pun in there somewhere) and delving into the Spirit code to see how it’s structured internally, and maybe I’ll uncover some ideas for NimbleParsec.

One thing to note is that while Spirit does a lot with C++ syntax, it does have the benefit of having a massively optimising compiler behind it, meaning that the “resultant code” it generates through templating does not have to be so efficient because the C++ compiler will do things like inline functions and unroll loops and optimise accesses etc etc. I think a big point to note with NimbleParsec is that it is itself a compiler of sorts—as much as it’s nice to have functions which return “parsers”, they don’t really return parsers, they return this custom AST (not Elixir AST, NimbleParsec AST) which NimbleParsec then compiles into actual Elixir functions. I’m not sure what the implications of that are yet, but I think it’s useful to keep in mind when thinking of NimbleParsec.

tmbb · December 19, 2018, 2:18pm

Yeah, context sensitive parsing is just about generating parsers at runtime depending on the data in context map. The tricky part is to optimize that as much as possible. But I believe that NimbleParsec can support it with few changed to the overall architecture.

OvermindDL1 · December 26, 2018, 11:36pm

Although it does rely on the compiler for that, it generates the code in the type system for in-place generation, thus no intermediary functions, etc… That means the compiler can optimize it really easily. In addition there are optimization passes that are ran over the type-tree to optimize certain constructs (via the Proto library it uses behind it). There is a significant amount of work it does itself to ensure code is generated a very specific way to get the assembly output that they want.

+1

tmbb · December 27, 2018, 12:05am

@josevalim would you be open to changing the “parser AST” in order to include metadata information that could e used with my combinators/1 macro?

Edit: typo

mjadczak · December 28, 2018, 3:22pm

I think that what I was thinking here was that in the case of C++, returning a parser “dynamically” is not extremely different from having a “static” one—all it is is a closure with the same code as for the “static” case, just with the compiler providing some bindings for some variables transparently and then populating those variables at runtime. That’s because all of the code is going through the same compiler—the C++ compiler.

In NimbleParsec’s case, the defparsec macro is in itself a compiler. Unless it’s somehow possible to move its logic higher up somehow (e.g. as a mix compiler) so that it can “transparently” act on code inside “dynamic” fns (i.e. inside closures), I think the “right” way to implement this is not to have worse-performing dynamic-style parsecs, but rather to empower NimbleParsec to also transform fn x, y, z -> parsec into a real piece of the optimised NimbleParsec code, but tracking all of the bindings to x, y and z and actually injecting them into the real generated code—essentially what the compiler does anyway when constructing closures.

josevalim · December 28, 2018, 4:48pm

Sure. Can you do a quick and small PR that changes one or two combinators, as an example? And once we agree on the shape we can change the rest.

tmbb · January 1, 2019, 10:42pm

I thought it would be easy to make some minimal changes in NimbleParsec without understanding the whole codebase, but it hasn’t proved easy at all. I’ll have to take some time to understand it better before I can change anything.

mattste · May 19, 2020, 10:38pm

Our frontend is implementing a custom language for users to query data (simpler than SQL). I am wondering if NimbleParsec has a way that could help me provide “autocomplete” like functionality.

For example, a user types if points and then an autocomplete would pop-up with a list of next possible tokens such as [">", "<", "="] .

tmbb · September 28, 2020, 10:45am

I’m trying to write a parser combinator that matches (and consumes) a regex at the beginning of the input string. For example:

defmodule RegexParserExample do
  import NimbleParsec
  import RegexParsec
  # Parses javascript-style inline comments
  comment = regex(~r[//(.*?)\n])
  defparsec(:comment, comment)
end

The closest I have come is the following, using the post_traverse combinator:

defmodule RegexParsec do
  import NimbleParsec

  def regex(combinator \\ empty(), %Regex{} = re) do
    anchored_re = Regex.compile!("^" <> re.source, re.opts)
    post_traverse(combinator, {__MODULE__, :__regex__, [anchored_re]})
  end

  def __regex__(input, args, context, _line, _offset, re) do
    case Regex.run(re, input) do
      [bin | _] ->
        length = byte_size(bin)
        << _part :: binary-size(length), _rest :: binary >> = input
        # Doesn't consume input!
        {[bin | args], context}

      nil ->
        {:error, "doesn't match regex #{inspect(re)}"}
    end
  end
end

The problem is that none of the *_traverse combinators allow me to to actually consume the input. They can only transform a result that comes from another parser, but not consume new bytes from the input string. The only way I can think of doing this is by having my parser return a parsec, but I’m not sure how to keep the context in a custom parsec (I don’t actually think any parser out there is actually using the context, but still).

I don’t know if even this possibility is recommended, because I don’t think there’s an API to add parsecs aside from the obvious one.

josevalim · September 28, 2020, 1:11pm

Can the new lookahead stuff be used for this? First you look ahead and then you parse if it matches.

tmbb · September 28, 2020, 1:24pm

I can look ahead just fine, the problem is actually consuming the input. I’d need something like a combinator that would see how many bytes the regex matched and then advance that number of bytes in the string. I’ve actually found a solution that depends on creating a custom parsec from scratch, but it’s not the most elegant:

defmodule RegexParsec do
  import NimbleParsec

  defmacro defregexparsec(name, re) do
    {evaluated_re, _} = Code.eval_quoted(re)
    anchored_re = Regex.compile!("^" <> evaluated_re.source, evaluated_re.opts)
    escaped_re = Macro.escape(anchored_re)
    name__0 = String.to_atom(Atom.to_string(name) <> "__0")

    quote do
      def unquote(name)(input) do
        unquote(name__0)(input, [], [], %{}, {1, 0}, 0)
      end

      def unquote(name__0)(input, acc, stack, context, {line, offset}, column) do
        case Regex.run(unquote(escaped_re), input) do
          [bin | _] ->
            length = byte_size(bin)
            # Split the string in two parts; keep the second part
            <<_part::binary-size(length), rest::binary>> = input
            # How many lines have we advanced?
            lines = String.split("\n")
            delta_lines = length(lines)
            # The column is the byte offset of the last line (I think?)
            new_column =
              case lines do
                [] -> 0
                _ -> lines |> List.last() |> byte_size()
              end

            # I hope I'm returning the right format?
            {:ok, [bin | acc], rest, context, {line + delta_lines, new_column}, column + length}

          nil ->
            {:error, unquote("expected to match regex #{inspect(evaluated_re)}"), input, context,
             {line, offset}, column}
        end
      end
    end
  end
end

defmodule RegexParserExample do
  import NimbleParsec
  import RegexParsec

  defregexparsec(:comment, ~r[//(.*?)\n])
  defparsec(:comments, repeat(parsec(:comment)))
end

It depends on some implementation details, of nimble_parsec, but it seems to work.

tmbb · July 10, 2021, 2:18pm

In case someone might actually think of using it, be aware that this approach is veeeery slow… Matching regexes in the middle of a parsec is a big “No” for performance. An interesting alternative would be to compile a regex to NimbleParsec combinators, but that seems more work than it’s worth.