NimbleParsec - a simple and fast parser combinator for Elixir

tmbb · March 3, 2018, 8:57pm

Your problem here is not a lack of bases, it’s the lack of a state system xD

josevalim · March 3, 2018, 10:04pm

I have added support for state. Here is how to implement a simple XML parser with it:

dashbitco/nimble_parsec/blob/master/examples/simple_xml.exs

# Run it from root with `mix run examples/simple_xml.exs`

defmodule SimpleXML do
  import NimbleParsec

  @doc """
  Parses a simple XML.

  It is meant to show NimbleParsec recursive features.
  It doesn't support attributes. The content of a tag is
  either another tag or a text node.
  """
  defparsec :parse, parsec(:node) |> eos()

  tag = ascii_string([?a..?z, ?A..?Z], min: 1)
  text = ascii_string([not: ?<], min: 1)

  opening_tag = ignore(string("<")) |> concat(tag) |> ignore(string(">"))
  closing_tag = ignore(string("</")) |> concat(tag) |> ignore(string(">"))

This file has been truncated. show original

We can support two new primitives here. One is context_choice which allows you to choose which branch to process based on what is under a given key in the context.

The other one is to allow context keys to be given to times. So you can say “this integer min/max/length will be given by certain key in context”.

Both should be straight-forward to add and are generalizations of primitives we currently have in place. I don’t plan to work on them now but if someone is working on a parser that needs it, please open up an issue with the use case, code examples and rationale and we can find out the best way to handle it.

Eiji · March 3, 2018, 10:05pm

@josevalim: How about add optional opts?

I prefer to write: rule(term, ignore: true, optional: true) instead of optional(ignore(rule(term))).

For real world example simply compare my version with its original form from README.md:

defmodule MyParser do
  import NimbleParsec

  date =
    integer(4)
    |> literal("-", ignore: true)
    |> integer(2)
    |> literal("-", ignore: true)
    |> integer(2)

  time =
    integer(2)
    |> literal(":", ignore: true)
    |> integer(2)
    |> literal(":", ignore: true)
    |> integer(2)
    |> literal("Z", optional: true)

  defparsec :datetime, date |> literal("T", ignore: true) |> concat(time), debug: true
end

I know that’s extra work for you, but at least for me code will look much cleaner.

mischov · March 4, 2018, 12:13am

defmodule NimbleParsec do
  @moduledoc "README.md"
             |> File.read!()
             |> String.split("<!-- MDOC !-->")
             |> Enum.fetch!(1)

Where has this been my whole life?

OvermindDL1 · March 4, 2018, 7:00am

Yep, that’s the traditional way in the original Spirit. Though it’s syntax makes it shorter than the same in Elixir.

And the ability to pass state as arguments to the parsers… ^.^;

A few issues:

First, the length, all of this in Nimble:

  text =
    ascii_char(not: ?<)
    |> repeat(ascii_char(not: ?<))
    |> reduce({List, :to_string, []})

Is this in ExSpirit:

  defrule text( chars(-?<) )

And this in Nimble:

  letter = ascii_char([?a..?z, ?A..?Z])
  tag = letter |> repeat(letter) |> reduce({List, :to_string, []})

Is this in ExSpirit:

  defrule tag_name( chars([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]) )

And this in Nimble:

  opening_tag =
    ignore(string("<"))
    |> concat(tag)
    |> ignore(string(">"))

  closing_tag =
    ignore(string("</"))
    |> concat(tag)
    |> ignore(string(">"))

  defparsec :__xml__,
            opening_tag
            |> traverse(:store_tag_in_context)
            |> repeat_until(
              choice([
                parsec(:__xml__),
                text
              ]),
              [string("</")]
            )
            |> wrap()
            |> concat(closing_tag)
            |> traverse(:check_close_tag_and_emit_tag)

  defp store_tag_in_context([tag], %{tags: tags} = context, _line, _offset) do
    {[tag], %{context | tags: [tag | tags]}}
  end

  defp check_close_tag_and_emit_tag([closing, [opening | contents]], context, _line, _offset) do
    if closing == opening do
      context = update_in(context.tags, &tl/1)

      element =
        case contents do
          [text] -> {String.to_atom(opening), [], text}
          nodes -> {String.to_atom(opening), [], nodes}
        end

      {[element], context}
    else
      {:error, "closing tag #{inspect(closing)} did not match opening tag #{inspect(opening)}"}
    end
  end
end

Is just this in ExSpirit:

  defrule tag(
    lit(?<) |> tag_name() |> put_state(:tagname, :result) |> lit(?>) |> expect(seq([
      get_state_into(:tagname, tag(&1, repeat(node_()))),
      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)
    ]))
  )

  defrule node_(
    alt([
      tag(),
      text(),
    ])
  )

In addition, though this is an example, it was demonstrating features that, though easy to bypass in this example (you can do it in ExSpirit the same way you did in Nimble), is not bypassable in many other cases. Specifically this line in ExSpirit:

get_state_into(:tagname, lit(&1))

How would this be done in Nimble, your example does not demonstrate this thus your example is not a faithful recreation. This specific expression, though in this case can be done by comparing ‘later’, this shows that you can parameterize parsers based on prior parsed data, and this is what I’ve set to see how you could do it in Nimble as this is an absolutely mandatory capability that a lot of languages need to parse else the parser can become infinitely large in order to handle all possible variants.

Like to make it slightly more complex (though still not infinitely configurable as some languages get), how would you parse the xml example of <foo><bar>one<baz>two</foo> as if it were <foo><bar>one<baz>two</baz></bar></foo>? In ExSpirit that would involve changing this:

      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)

To be something like (although format it better of course):

      alt([lit("</"), get_state_into(:tagname, lit(&1)), lit(?>), success()])

And then you get nice auto-closing tags when left open, and yet it still fails if there is a mismatched closing tag.

Also, these two cases in your parse call:

      {:ok, _, rest, %{tags: []}, line, offset} ->
        {:error, "document continued after last closing tag", rest, line, offset}

      {:ok, _, rest, %{tags: [tag | _]}, line, offset} ->
        {:error, "tag #{inspect(tag)} was not closed", rest, line, offset}

How would you enforce those as part of the grammar? In ExSpirit for the first you just add an eoi() (and an alternate fail("reason") if you want a customized message) at the end, and for the second, well, that’s already an error in ExSpirit, like why does that parse at all in Nimble as it really absolutely should not (not as big of an issue for something trivial like SimpleXML but a larger issue when you have complex grammars that need to be self-terminating based on state information).

For comparison, here’s how you make SimpleAST in the original Spirit:

        text = lexeme[+(char_ - '<')        [_val += _1]];
        node = (xml | text)                 [_val = _1];

        start_tag =
                '<'
            >>  !lit('/')
            >>  lexeme[+(char_ - '>')       [_val += _1]]
            >>  '>'
        ;

        end_tag =
                "</"
            >>  string(_r1)
            >>  '>'
        ;

        xml =
                start_tag                   [at_c<0>(_val) = _1]
            >>  *node                       [push_back(at_c<1>(_val), _1)]
            >>  end_tag(at_c<0>(_val))
        ;

And yes, that is entire pure and valid C++, no preprocessor or anything of the sort needed, consequently @tmbb you see how the local state it uses has to be passed up and down through the parsers? And yes, it will generate code that is at least on par but usually better than hand-rolling your own parser in any language, even assembly, without substantial work (and even with that work you will only get on par with Spirit in speed, literally if you find any time where spirit is slower than anything else it can optimize for that and fix it, that is part of it’s design, which I did not follow in ExSpirit for a quick whipping up, though the expression version would be better at it).

That is not as useful though, need to pass state into the parser arguments to be able to parse many language grammars, else you end up with infinite parser variations required (or falling back to multiple passes and more code and mis-matched errors with the parsers and all that).

It would not just be times, you’d have to allow passing state as an argument to an any parser to really allow it in full (which the original spirit allows for and ExSpirit allows for in ‘most’ cases, though not all…)

Eh, it might be nicer but generalizing that to everything everyone might possibly make sounds a bit irritating since in his design they are all just functions and not a built expression tree like the C++ spirit is).

Lol! Just module attributes so you can do whatever. ^.^

OvermindDL1 · March 4, 2018, 7:02am

Also another thing, how do you define a skip parser in Nimble? Or do you have to add the skip parser to every-single-possible-rule-and-skippable-location-which-is-utterly-ginormous-in-count?

josevalim · March 4, 2018, 7:18am

Not planned. You can easily wrap it though!

As the name of the library says, my plan is to stay nimble and not focus in providing high level constructs. The fact it allows runtime composition and the focus on primitives mean you should be able to build chars and text yourself, like I did. Put those in a module as regular functions and you will be able to use them as if they were part of nimble.

To clarify, I was not trying to recreate ex_spirit parser here. I haven’t checked its implementation. In any case, what you proposed can be implemented with traverse, which receives the context and can return the accumulator and a possibly updated context.

It can be done by using traverse as well but it should be easy to add a lookahead construct too.

I don’t think this will ever be possible. All combinators are compiled down to function clauses. A combinator cannot be used with state information unless the underlying combinator has been taught to read from state. Which goes back to the point I want to rather focus on the small set of primitives, making them powerful enough to build everything else on top of it.

I mentioned this in the issues tracker but getting a traditional parser combinator and translating it to NimbleParsec is not going to work. I am pretty sure the C++ implementation is faster and more flexible than NimbleParsec but for nimble we need to stay within the rules of the Erlang VM. That’s why I am more interested in particular examples we cannot handle than a particular feature, because when all is said and done, the feature may need to be implemented in completely different way in nimble.

josevalim · March 4, 2018, 11:07am

v0.2.0 is out with lookahead, user contexts, custom errors, byte offsets and ascii_string/3 and utf8_string/3. See the CHANGELOG:

github.com

plataformatec/nimble_parsec/blob/master/CHANGELOG.md

# Changelog for NimbleParsec

## v0.5.1-dev

### Enhancements

  * Remove compiler warnings in code generated by `mix nimble_parsec.compile`

## v0.5.0 (2018-12-12)

### Enhancements

  * Introduce `pre_traverse/3`
  * Introduce `lookahead/2` and `lookahead_not/2` which accepts combinators
  * Introduce `eos/1` (end of string combinator)

### Backwards incompatible changes

  * Deprecate `traverse/3` in favor of `post_traverse/3`
  * `repeat_until/3` has been removed in favor of `repeat/2` with `lookahead_not/2` (see `repeat_while/3` docs for an example)

This file has been truncated. show original

tmbb · March 4, 2018, 11:26am

Cool. I’ll start rewriting my Elixir lexer directly in nimble_parsec. Then, when I see anything worth refactoring into its own module, I’ll start porting Makeup to see if it can be based on nimble_parsec (Makeup is basically a library of useful combinators that can be reused between languages + some plumbing so that the entire chain from lexer to formatter works.

This might take some time, especially because there isn’t a 1-to-1 mapping between ExSpirit combinators and NimbleParsec combinators.

josevalim · March 4, 2018, 1:25pm

I will be very glad to help if you have any questions or you need help translating some constructs!

josevalim · March 4, 2018, 1:42pm

Also, maybe a good suggestion is to start with the smaller language you currently handle. So we can directly compare compile and execution times.

tmbb · March 4, 2018, 2:06pm

start with the smaller language you currently handle

Smallest language? Currently I only handle HTML5 (without CSS and Javascript, those are in the making) and Elixir. The smallest language would be HTML, which has the “benefit” of using a context-sensitive approach. That’s a good idea.

(EDIT: have other lexers “almost ready” but still not good enough for publication)

tmbb · March 4, 2018, 2:17pm

I still have to study NimbleParsex a little better, because it’s quite a mental shift from ExSpirit to NimbleParsec (which operates inside the constraints of the BEAM)

josevalim · March 4, 2018, 4:44pm

Oops, yes, smallest.

HTML5 sounds like a great start. Although Nimble compiles and runs faster than ExSpirit for things like datetime, there is no guarantee it is also faster when handling languages like HTML5. So we being able to validate that would be useful.

Eiji · March 4, 2018, 5:35pm

For all interested in creating valid HTML5 parser: it’s not as easy as it looks. Let me share here some useful links:

Understanding HTML, XML and XHTML article
Parsing HTML documents - it’s part of specification (chapter 12.2) about parsing HTML documents from WHATWG community

@josevalim: I think that 2nd link could be helpful for you (if you have time to read it), because you can look at parsing HTML documents (i.e. it’s much more complicated than simple datetime parsers) step by step, so you can check if making such advanced parser is doable using your library. Hope it’s helpful resource.

josevalim · March 4, 2018, 6:06pm

Thanks @Eiji, I will take a look at it.

Although I think @tmbb will find out the limitations of nimble faster than me.

Btw, if someone is looking for fun things to do with NimbleParsec, we have a simple markdown parser in Elixir that converts markdown to ANSI code: https://github.com/elixir-lang/elixir/blob/master/lib/elixir/lib/io/ansi/docs.ex

I think we can have a nicer approach with NimbleParsec if we first parse the markdown into tokens and only then format it to ansi. Unfortunately Elixir cannot depend on NimbleParsec but since NimbleParsec emits code without a runtime dependency in itself, it means we can emit the parser code and just embed it in Elixir.

OvermindDL1 · March 7, 2018, 6:35pm

Yep, parsing HTML5 ‘properly’ is hell itself. ^.^;

However, doing lax SGML-style parsing along with the optional header element will grab 99%+ of all constructs.

tmbb · March 31, 2018, 2:28am

I’m the author of Makeup (a library for syntax highlighting of source code) and ExDocMakeup (a markdown processor that can be used with ExDoc that uses Makeup for syntax highlighting of code examples in the docs). It was a discussion around Makeup that initially prompted the development of NimbleParsec. Up until the most recent version, Makeup used to depend on the ExSpirit library by @OvermindDL1, which is much more flexible than NimbleParsec (as the discussion above shows), but much, much slower.

Using Benchee, I’ve prepared some benchmarks pitting the current version of Makeup (0.4.0) against the old version. I’ve gotten aproximately 10x speedups in both compilation time and runtime.

Here’s how to interpret the benchmarks below:

Formatter performance: the part that takes up a list of tokens and converts it into an HTML fragment
Lexer: the part that reads the raw source code and produces a list of tokens
Lexer compilation time: this was done by invoking the Elixir compiler at runtime on the relevant file. This is a good proxy measure of the “real” compilation process (it’s pretty cool that’s so easy to compile code at runtime in Elixir - never us this in production, though!)

The most important number is the “Lexer + Formatter”, because that’s what the user will do most of the time. The benchmarks use a fake elixir file with about 250 lines written by the authors of the python library Pygments (which is similar to makeup) to demonstrate most of the syntax rules.

ExSpirit (old version):

Name                                                              ips        average  deviation         median         99th %
Formatter performance                                          124.40        8.04 ms    ±97.25%          15 ms          16 ms
Reading file from disk + Lexer + Formatter (end to end)          7.29      137.24 ms     ±7.99%         140 ms         172 ms
Lexer performance                                                7.18      139.33 ms     ±9.72%         140 ms         157 ms
Lexer + Formatter                                                6.83      146.43 ms    ±10.56%         141 ms         187 ms
Lexer compilation time                                         0.0355       28156 ms     ±0.00%       28156 ms       28156 ms

NimbleParsec (new version):

Name                                                              ips        average  deviation         median         99th %
Formatter performance                                          620.02        1.61 ms    ±32.96%        1.60 ms        3.20 ms
Lexer performance                                              108.31        9.23 ms     ±4.86%        9.40 ms        9.40 ms
Lexer + Formatter                                               94.72       10.56 ms    ±69.37%          15 ms          16 ms
Reading file from disk + Lexer + Formatter (end to end)         92.32       10.83 ms    ±66.61%          15 ms          16 ms
Lexer compilation time                                           0.26     3812.50 ms     ±0.01%     3812.50 ms        3813 ms

The secret to the Formatter’s performance improvements was to use iolists whenever possible instead of binaries after a suggestion by @josevalim I’d never have thought that the difference would be so great. It turns out that concatenating strings on the BEAM is very slow and requires lots of copying. @josevalim has been extremely helpful in suggesting improvements to the lexer, answering questions about NimbleParsec and giving useful tips on how to increase performance of my library.

Performance improvements in the lexer were mainly due to the use of NimbleParsec (and to an obsessive amount of microbenchmaking). The performance of my lexer is now pretty much the same as the performance of the Elixir formatter, which means there are probably very little gains to be had. Makeup is quite similar to the Elixir formatter in that it parses files and prints a beautified version of the output. It was also written by people which are probably smarter than me and which have some actual knowledge of the internals of the BEAM.

This newest version of Makeup itself is not yet ready for prime time. You can help by running it on examples of real elixir code, inspecting the output and submitting an issue with examples tha look wrong. Such examples can be included in the test suite, which sadly doesn’t cover all syntax rules yet (and may never cover all rule combinations, that’s why it’s important to run it on real code to spot issues).

All of this will be easier after I’ve updated the docs with more useful information.

From my experience in the last few days, NimbleParsec seems to be ready to be used for real projects. It’s the ideal option for those who need to parse context-free languages, and with some postprocessing steps it can be used to help parse some context-sensitive ones. I don’t believe I’ve found any bug while using it for Makeup.

Thanks to @josevalim for writing this great library and for all the help and to @OvermindDL1 who’s written ExSpirit, without which Makeup would have never been written. Although quite slow, ExSpirit is way faster than almost anything out there and is still the only Elixir parser library that supports parsing context-sensitive languages.

If you’ll only take away one thing from this post, take this: string concatenation is slow. Iolists are the secret sauce that will make your programs go fast. Don’t replace your code without benchmarking it, though. Performance on the BEAM is weird, don’t trust your instincts from other languages.

OvermindDL1 · April 2, 2018, 4:29pm

+++

Yeah definitely should be outputting IOlists, even with ExSpirit. ^.^;

I’m really curious about when it will be able to parse context-sensitive languages, then it can fully replace ExSpirit and I can deprecate it (less stuff for me to manage/update, the better!).

Heh thanks, it was birthed because I needed it and nothing else out supported the features I needed. ^.^

I really want NimbleParsec to replace it, needs a few more features to be able to do that though.

kip · September 10, 2018, 3:50am

I’m rewriting (for the 3rd time!) the language tag parser in ex_cldr using nimble_parsec. One challenge I have is that the ABNF grammar has a couple of irregularities. For example:

zh-yue-HK means Chinese, Cantonese variant, as spoken in Hong Kong. Notice that the “language extensions” yue is three alpha characters.

zh-Hant-HK means Chinese, using the traditional Chinese script. Note that the script is four alpha characters.

The grammar for this part of the language tag is:

   langtag       = language
                   ["-" script]

   language      = 2*3ALPHA            ; shortest ISO 639 code
                   ["-" extlang]       ; sometimes followed by
                                       ; extended language subtags

   extlang       = 3ALPHA              ; selected ISO 639 codes
                   *2("-" 3ALPHA)      ; permanently reserved

   script        = 4ALPHA              ; ISO 15924 code

The issue I’m facing is that if there is a script but no extlang the extlang parser will consume the first three characters of the script and then fail with a parse error. For example zh-Hant-HK will be parsed as language zh, language extension Han and then fail in parsing t-HK.

I thought the lookahead macro would help here - I can look ahead in the stream and if the next character isn’t a - then I can fail the extlang parser. But … I can only {:error, _} the whole parse tree - not fail this one parser.

Is there another alternative approach I can use to {:fail, _} a parser without terminating the full parse?

The relevant code is here. Any and all comments welcome.