NimbleParsec - a simple and fast parser combinator for Elixir

mjadczak · December 11, 2018, 9:47pm

For what it’s worth, it’s simple enough to recreate by hand, but I could not get this diff to parse properly (kept getting “malformed patch”). Maybe forum formatting broke it somehow?

OvermindDL1 · December 11, 2018, 9:48pm

Huh, so you’d have to start the parser back up again, in another function, instead of just doing it in-position? o.O?

josevalim · December 11, 2018, 9:51pm

What would lookahead do in this case? Would it look if the combinator matches but without changing the accumulate state or the input and continue if so, otherwise it “denies”? And lookahead_not is the opposite? I think we can retire the current lookahead in favor of it. Especially as the current lookahead is really just a traverse.

Indeed. The current patch would be the starting point, not the destination.

I mean this is in the nicest way possible but don’t start this gate keeping non sense. I know there is a computer science background to all of this but catching up on it is not my priority as my CS reading time is usually focused elsewhere. I am doing it mostly for fun. If I had to do all of this research before, the idea would still be on paper, which would be a shame. It is open source, there are no warranties, warts and all.

mjadczak · December 11, 2018, 10:00pm

Yes, exactly that. In fact, lookahead_not is the more fundamental one, since you could write def lookahead(c), do: c |> lookahead_not() |> lookahead_not() (although when compiled naively, it would require an extra case statement to execute). And I agree that the current lookahead could be retired and easily implemented using traverse when needed.

I agree that it’s probably not a good idea to require an academic paper before you start writing some code for fun But, let me offer an alternative perspective—for fields such as parsing, where there are decades of research out there, sometimes it’s really helpful to be able to pick up some fundamental properties of the thing you’re working with. It can definitely save a lot of time! For me, I spent a few hours trying to get lookahead-style parsing to work with NimbleParsec, before I ended up realising that it was actually a PEG parser with one combinator missing, and then I could immediately prove that it wasn’t possible. Also, for a side-project like this, with no comprehensive documentation or tutorials, being able to say “this is a PEG parser” lets people access all the different resources written about this style of parser. FWIW, I think having a fast PEG parser is a great asset to Elixir, and I’m also pretty certain that once negative lookahead is properly supported, we can confidently assert that NimbleParsec is at least as powerful as a PEG parser

josevalim · December 11, 2018, 10:13pm

I totally agree. My point is that because I did not start by wanting to write a complete parser, but rather something that can convert to pattern matching, it never crossed my mind to write a complete parser. And even though now I am aware it could be a complete parser, I don’t have the time to pour into studying what a complete parser is, so I either continue evolving it piece by piece with fantastic input like yours or I don’t evolve it at all.

I completely understand this is not ideal and I completely understand that by saying this some people may not use nimble_parsec as they would prefer a parsing library built on proper foundations. That’s fine by me because that was never my concern.

In any case, I have lookahead ready for choice and lookahead_not should be an invert of that, I will push soon, but I need to understand how it works with things like traverse and repeat before I consider it done.

OvermindDL1 · December 11, 2018, 10:18pm

What do you need to know? I’ve worked on (both building and using) Full PEG parsers for nearing 20 years at this point. ^.^

josevalim · December 11, 2018, 10:56pm

@OvermindDL1 @mjadczak is the behaviour of my repeat correct?

Take the example below:

defparsecp :lookahead_not_with_repeat,
               repeat(ascii_char([]) |> lookahead_not(ascii_char([?0..?9])))

I would expect it to return this:

assert lookahead_not_with_repeat("aa0") ==
               {:ok, 'a', "a0", %{}, {1, 0}, 1}

But it returns:

{:ok, '', "aa0", %{}, {1, 0}, 1}

I.e. it is happy to parse nothing if it halts anywhere. However, if I use times(min: 1), then it works as expected but then it always aborts at the minimum. So it seems they are not greedy.

josevalim · December 11, 2018, 11:10pm

@mjadczak here is a commit with lookahead and lookahead_not: https://github.com/plataformatec/nimble_parsec/commit/71663d1096bb9e556965fe7013cb2e462bf405c8

I will wait for your feedback then I will tidy it up with docs and what not.

OvermindDL1 · December 11, 2018, 11:20pm

PEG is absolutely greedy, it has no non-greedy constructs, you have to be explicit.

So let’s see:

repeat(ascii_char([]) |> lookahead_not(ascii_char([?0..?9])))

Which repeats just

ascii_char([]) |> lookahead_not(ascii_char([?0..?9]))

Which will match a, aa, aaaaa, but not a0, thus the first loop matches out the aa0 and extracts the first a and returns 97/?a, leaving a0, it repeats again and fails to match a0 thus ending, so repeat returns [97]/[?a]/'a', thus the successful return value should return [97] with a leftover of "a0", and this is indeed what ExSpirit does:

iex(1)> defmodule Testing do
...(1)>   use ExSpirit.Parser, text: true
...(1)>   defrule lookahead_not_with_repeat(repeat(char()|>lookahead_not(char(?0..?9))))
...(1)> end
{:module, Testing,
 <<70, 79, 82, 49, 0, 0, 111, 28, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 3, 5,
   0, 0, 0, 77, 14, 69, 108, 105, 120, 105, 114, 46, 84, 101, 115, 116, 105,
   110, 103, 8, 95, 95, 105, 110, 102, 111, 95, ...>>,
 {:lookahead_not_with_repeat, 1}}
iex(2)> import ExSpirit.Parser
ExSpirit.Parser
iex(3)> parse("aa0", Testing.lookahead_not_with_repeat)
%ExSpirit.Parser.Context{
  column: 2,
  error: nil,
  filename: "<unknown>",
  line: 1,
  position: 1,
  rest: "a0",
  result: 'a',
  rulestack: [],
  skipper: nil,
  state: %{},
  userdata: nil
}

In addition to other useful data like the matching position/column/line and so forth. Making it able to parse out any length of chars that don’t contain numbers after the starting position could be enforced via an expectation as well:

iex(15)> defmodule Testing do
...(15)>   use ExSpirit.Parser, text: true
...(15)>   defrule lookahead_not_with_repeat(repeat(char()|>expect(lookahead_not(char(?0..?9)))))
...(15)> end
warning: redefining module Testing (current version defined in memory)
  iex:15

{:module, Testing,
 <<70, 79, 82, 49, 0, 0, 112, 12, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 3, 23,                                                       
   0, 0, 0, 78, 14, 69, 108, 105, 120, 105, 114, 46, 84, 101, 115, 116, 105,                                                          
   110, 103, 8, 95, 95, 105, 110, 102, 111, 95, ...>>,
 {:lookahead_not_with_repeat, 1}}
iex(16)> import ExSpirit.Parser
ExSpirit.Parser
iex(17)> parse("aaaaa", Testing.lookahead_not_with_repeat)
%ExSpirit.Parser.Context{
  column: 6,
  error: nil,
  filename: "<unknown>",
  line: 1,
  position: 5,
  rest: "",
  result: 'aaaaa',
  rulestack: [],
  skipper: nil,
  state: %{},
  userdata: nil
}
iex(18)> parse("aa0", Testing.lookahead_not_with_repeat)
%ExSpirit.Parser.Context{
  column: 3,
  error: %ExSpirit.Parser.ExpectationFailureException{
    context: %ExSpirit.Parser.Context{
      column: 3,
      error: nil,
      filename: "<unknown>",
      line: 1,
      position: 2,
      rest: "0",
      result: 97,
      rulestack: [:lookahead_not_with_repeat],
      skipper: nil,
      state: %{},
      userdata: nil
    },
    extradata: %ExSpirit.Parser.Context{
      column: 4,
      error: nil,
      filename: "<unknown>",
      line: 1,
      position: 3,
      rest: "",
      result: 48,
      rulestack: [:lookahead_not_with_repeat],                                                                                        
      skipper: nil,
      state: %{},                                                                                                                     
      userdata: nil
    },
    message: "Lookahead_not failed"
  },
  filename: "<unknown>",
  line: 1,
  position: 2,
  rest: "0",
  result: nil,
  rulestack: [],
  skipper: nil,
  state: %{},
  userdata: nil
}
iex(19)> parse("aa0aaa", Testing.lookahead_not_with_repeat)
%ExSpirit.Parser.Context{
  column: 3,
  error: %ExSpirit.Parser.ExpectationFailureException{
    context: %ExSpirit.Parser.Context{
      column: 3,
      error: nil,
      filename: "<unknown>",
      line: 1,
      position: 2,
      rest: "0aaa",
      result: 97,
      rulestack: [:lookahead_not_with_repeat],
      skipper: nil,
      state: %{},
      userdata: nil
    },
    extradata: %ExSpirit.Parser.Context{
      column: 4,
      error: nil,
      filename: "<unknown>",
      line: 1,
      position: 3,
      rest: "aaa",
      result: 48,
      rulestack: [:lookahead_not_with_repeat],
      skipper: nil,
      state: %{},
      userdata: nil
    },
    message: "Lookahead_not failed"
  },
  filename: "<unknown>",
  line: 1,
  position: 2,
  rest: "0aaa",
  result: nil,
  rulestack: [],
  skipper: nil,
  state: %{},
  userdata: nil
}

The reason I point out except is it is also a very standard PEG operator in PEG libraries that allows you to ‘early kill’ something like an alt or so, like say you are parsing if, well if you put the rest of the parsing after if in an except(...) then if it fails to parse inside then you know the code is bad, you don’t need to test other alternatives like identifiers and so forth or you will just end up getting an error in another unrelated place. That should be added to NimbleParsec as well.

josevalim · December 12, 2018, 12:00am

Perfect. NimbleParsec is behaving the same way now. Thanks a lot!

mjadczak · December 12, 2018, 12:29am

Nice! And yes, I agree with how @OvermindDL1 describes it.

I think something to keep in mind, which may be useful in working out how something “should” behave, is a standard model of whether a parser “fails” or “succeeds”. Once every single parser can be said to either “fail” or “succeed” given some input, potentially consuming some input if it succeeds, then it becomes clear—repeat(p) tries to let p consume as much input as possible, while it succeeds, and once it fails (necessarily not consuming any input), it goes to the next thing in the sequence. choice([a, b, c]) tries to apply a to the input, and if it succeeds, proceeds to the next parser in the sequence. If a fails (not consuming input), then b is tried instead (and then c, and if all choices fail, the whole choice combinator fails).

N.B. Typically these things are expressed in terms of “consuming input”, but in NimbleParsec/Elixir terms “input consumed” == “the next combinator in the chain is called with a binary which no longer includes the consumed input”.

Once you have a unified way in which all parsers either succeed or fail, you can then e.g. express repeat_until like I said: repeat(lookahead_not(until) |> to_repeat()). The inner parser p in this case is lookahead_not(until) |> to_repeat() (where we’ve assumed that to_repeat() will consume some input, like a string or whatever). The outer repeat passes its current input to this p. The lookahead_not checks whether the immediate input (i.e. whatever’s at the start of the current binary) matches until. Let’s say at first it doesn’t. So lookahead_not succeeds, but does not consume any input—it lets to_repeat() consume it instead. Now the binary has moved along, and repeat gives it to p again. lookahead_not passes this input to until, and it matches! lookahead_not discards the results (accumulator) and also pretends the input was never consumed, but fails. Now repeat knows to succeed (it always succeeds, because it’s a 0 or more repetition) instead of to keep repeating.

(I hope I got the above all right, @OvermindDL1 please correct me otherwise)

Now, obviously if you actually implemented it like this, you would no longer have a super-efficient pattern-matching parser, but just yet-another-PEG parser which needs to interpret all of this information at runtime. Hence, I’m not necessarily saying that each parser/combinator should actually explicitly return an :ok or :error which is matched on. Rather, I’m saying that this is a good mental model of what is happening, and should help to answer questions like “how should this combination of parsers behave”?

Regardless, the commit looks good, especially now you’ve got the behaviour with repeat worked out. I’m looking forward to trying it out! I’m currently tinkering with trying to implement a full-featured (spec-compliant) XML parser using NimbleParsec

tmbb · December 12, 2018, 1:42am

This isn’t “gate-keeping nonsense”. I’m just saying that you or anyone else is not qualified to write parsers. In fact, on several occasions, I have praised the design of NimbleParsec and its composability, which is in many ways superior to ExSpirit, which is a good example of a fully general parser. NimbleParsec is also better than anything I would ever write, and also very fast, because of the way it compiles into binary pattern matches. This owes a lot to your insights on performance on the BEAM.

Tha said, I’d like to highlight a quote from post in thist opic:

This seems to have been lost in translation somwhere. I was under the impression that NimbleParsec was supposed to support (at least) the same as a PEG parser. I was under the impression that it already supported such grammars, or at least it had been designed with that goal in mind.

I mever got the impression that it was supposed to be a limited parser. Otherwise I’d never have suggested the context-sensitive features. Adding context-sensitivity on top of a parser that isn’t wven equivalent to a PEG parser was not one of my brightest ideas.

The main problem here is that parsers with different capabilities need different architectures.

For example, you can’t reliably parse HTML with regular expressions. Which means that if you qant to parse HTML your parser can’t be based on a regular expressions engine.

That’s why I think it’s important to decide which capabilities you want in your parser before commiting to an architecture.

From what I understand, the main goal is to compile into fast binary pattern matches. That is a perfectly valid design goal, and in no way i want to criticise that goal as invalid.

But, i is possible that this design will make it imposible to recognize certain patterns. And again, this is perfectly valid.

I just wanted to illustrate the tradeoff we’re dealing with… Was it clear?

josevalim · December 12, 2018, 8:31am

Love this! I have just verified this is actually the case and therefore removed repeat_until from master.

josevalim · December 12, 2018, 9:40am

One last update: I have optimized bound lookaheads, so looking ahead patterns where the size is known upfront now generates only one clause, which is super efficient. This makes things like repeat_until more efficient than it was before using a more composable construct! v0.5 should be out in the next minutes.

Indeed, thanks!

All I can say is that I never said any of this.

In any case, according to @mjadczak, we should be equivalent to a PEG parser now, so hurray. If somebody is confident about this topic and want to send a PR that amends the README accordingly, please do it!

tmbb · December 12, 2018, 9:54am

No, you didn’t, but because you presented it as a (not as featureful) alternative to ExSpirit, I thought it was at least as powerful as a PEG parser. But that was not your fault.

josevalim · December 12, 2018, 11:19am

@tmbb Oops, apparently NimbleParsec 0.5.0 broke makeup (and ex_doc). I have written a PR here: https://github.com/tmbb/makeup/pull/18 - can you please ship a new version?

It should also be relatively straight-forward to migrate to v0.5.0, you likely only need to rewrite replace_until(foo, bar) as replace(lookahead_not(bar) |> foo) but I can’t look into it right now, sorry.

tmbb · December 12, 2018, 11:51am

This sentence should have been “I’m NOT saying that you or anyone else is not qualified to write parsers”. Proofreading on mobile is hard

Yes, as soon as I get home.

OvermindDL1 · December 12, 2018, 5:06pm

Woot!

Precisely.

Traditional PEG parsers are little composable functions that operate on a ‘state’ structure, that state structure holds things like the unparsed input, positioning information, returned result, metadata, etc… ExSpirit is almost a direct translation of the ‘traditional’ PEG library, which though crazy efficient on efficient languages like C++/Rust/OCaml is less efficient on Elixir (hence why I really need to finish my Parserx version). But each ‘function’ just takes the state and returns state, the first thing every function should do is test if the state is in a good state (no error condition) and if not then immediately return it directly (passing the existing error back) else it does whatever it needs to do and transforms the state and returns it.

Functions that take other functions, like repeat or lookahead_not can do things like call the input and test it’s returns state until it fails, then take all the successful results and put them in a list and return them, or for lookahead_not it runs the function and it just returns a flipped error state from the originally passed in state (ignoring everything else of the state returned from the function).

mjadczak:

Once you have a unified way in which all parsers either succeed or fail, you can then e.g. express repeat_until like I said: repeat(lookahead_not(until) |> to_repeat()) . The inner parser p in this case is lookahead_not(until) |> to_repeat() (where we’ve assumed that to_repeat() will consume some input, like a string or whatever). The outer repeat passes its current input to this p . The lookahead_not checks whether the immediate input (i.e. whatever’s at the start of the current binary) matches until . Let’s say at first it doesn’t. So lookahead_not succeeds, but does not consume any input—it lets to_repeat() consume it instead. Now the binary has moved along, and repeat gives it to p again. lookahead_not passes this input to until , and it matches! lookahead_not discards the results (accumulator) and also pretends the input was never consumed, but fails. Now repeat knows to succeed (it always succeeds, because it’s a 0 or more repetition) instead of to keep repeating.

Close enough yeah, it’s more like the transforming ‘piped data’ flow I described above internally in most implementations.

It’s possible to make a pattern matching version of it though with a lot more code generation, I started working on such a thing in the Parserx version but never continued it since NimbleParsec came out that mostly completed the same thing (though with significantly fewer features), so I figured I could use NimbleParsec where it is good and mine where it was necessary (I’ve even used them together, NimbleParsec is very usable from within ExSpirit ^.^).

Yeesh, you’ll need multiple passes for that. You cannot represent a proper XML parser in PEG without state handling, thus meaning you have to use multiple passes. Like I made a mini-xml parser (no attributes) in ExSpirit as an example at ex_spirit/examples/simple_xml.exs at master · OvermindDL1/ex_spirit · GitHub and you can see the required state work for accurate parsing in it:


defmodule SimpleXML do
  use ExSpirit.Parser, text: true

  defrule text( chars(-?<) )

  defrule tag_name( chars([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]) )

  defrule tag(
    lit(?<) |> tag_name() |> put_state(:tagname, :result) |> lit(?>) |> expect(seq([
      get_state_into(:tagname, tag(&1, repeat(node_()))),
      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)
    ]))
  )

  defrule node_(
    alt([
      tag(),
      text(),
    ])
  )

  def from_string(input) do
    parse(input, node_())
  end

end

You can see that when it parses the tag_name then it puts it into the state on that current rulestack of the name :tagname, then it pulls the :tagname state into two different places, first is into a tag, which just wraps the children so an XML node of <blah>children</blah> becomes {"blah", children}, and also into the closing tag, so an opening <blah> has to match a closing </blah> or compilation fails. As per the example run of it:

$ mix run examples/simple_xml.exs
Input a single line of xml-like syntax:
<test1>Some text<test2>Hi</test2> and more</test1>
Result: {"test1", ["Some text", {"test2", ["Hi"]}, " and more"]}

$ mix run examples/simple_xml.exs
Input a single line of xml-like syntax:
<a-tag>How about an improperly terminated tag</b-tag>
<unknown>:1:48: Expectation Failure: literal `a-tag` did not match the input
        RuleStack: [tag, node_]
        Input: b-tag>

$ mix run examples/simple_xml.exs
Input a single line of xml-like syntax:
<
<unknown>:1:1: Parse error: Repeating over a parser failed due to not reaching the minimum amount of 1 with only a repeat count of 0
        RuleStack: [text, node_]
        Input: <

And it can easily be decorated to transform error messages into something more domain-specific like closing tag does not match opening tag or something as well by decorating the PEG commands with another. All single pass. This is why I say that state handling is so extremely important in PEG parsers and why all of them have some form of state handling. I can trivially make a calculator in ExSpirit that let’s you define variables, calculate, etc… all in a single pass. I’ve already made a single pass Lisp (with macro’s and read-macro!) and Forth parsers in ExSpirit and all.

ExSpirit’s goal was supposed to be a Traditional PEG Parser, I tried to make it fast (and it was faster than all others in Elixir until NimbleParsec came along) but accuracy was my first goal (and there are still a couple of niggles I’m not happy with that Parserx was supposed to fix as well).

Make sure to bump the major version! ^.^

Ooo, very cool! I really should flesh out the Parserx branch sometime… ^.^;

Eh, almost, it’s missing a couple of features still, like expect is absolutely necessary to have proper errors at the correct locations in many cases and state handling is extremely important in many types of parsers or you can’t get context-sensitive parsers to work without multiple passes (which often makes errors significantly more inaccurate as well as overall slower). Maybe a couple other things but those are the biggies.

josevalim · December 12, 2018, 5:20pm

I am going by the definition here: Parsing expression grammar - Wikipedia

If you have some time, can you please expand on what I am missing? If you could provide a reference, even better!

We state-handling via traverse and repeat_while. I.e. it is possible to pass a context around, with your own state, as nimble parsec operates. What am I missing here too?

tmbb · December 12, 2018, 5:41pm

@OvermindDL1 is mixing a couple things that make his post confusing. A PEG parser is something that abides by the definition above. PEG parsers don’t need to keep state and they don’t need to recognize context-sensitive languages.

I believe they also don’t require the expect combinator to be considered PEG parsers. It’s just something that’s nice to have. Currently, NimbleParsec handles all errors as something that says that you should try the next parser. But sometimes, you might have an error in which you want to stop parsing altogether instead of keep trying. That saves you time (because you interrupt the parsing process) and gives better error messages. Basically, expect is a convenience combinator which raises an exception which halts parsing completely (and could even be implemented as such).

So, there are two things here:

State handling, à la Spirit, to recognize things like indentation, matching XML tags, etc
The expect handler

None of which are strictly required in a PEG parser. The first makes your parser more powerful than a PEG parser and the second one is just a convenience.