NimbleParsec - a simple and fast parser combinator for Elixir

Thank you @tmbb for explaining! I think the expect handler should be straight-forward to implement. For the state stuff, I will wait for use cases as usual. :slight_smile:

The thing to watch out for, is that formally, a parser is just something which can recognise a grammar with particular properties. A PEG (Parsing Expression Grammar) is simply a language grammar which can be written as a combination of the operators on the Wikipedia page. Any practical Context-Free Grammar will likely be also expressible as a PEG (though this is technically an open academic question).

The key thing to remember is that for the parser to be “complete”, it just needs to be able to take in a grammar (i.e. the definition the user writes for the parser) and the input, and either say “yes, this input follows the grammar” or “no, this input does not follow the grammar”. Hence, my statement was “now NimbleParsec can recognise any PEG grammar”, which of course is a prerequisite for “NimbleParsec can use a PEG grammar to turn binary input into some useful datastrucure”, which @OvermindDL1 is referring to. Unfortunately, this problem is far less interesting to researchers, and so “what features are missing” is probably better answered by reference to other popular PEG-style parsers rather than a paper or reference page :slight_smile:

This is indeed how I was planning to handle XML tag closing/opening. I can see that right now @OvermindDL1 is writing an answer, but my guess is that the answer will again involve the fact that these are not really composable with the rest of the parsing definitions, but rather require a specific external function to be called (if they were composable, you could make a version of every single base combinator which also depends on or modifies some state). There’s also the local/global state thing which ExSpirit has, (I actually have no idea how that works or why it is useful though).

1 Like

The academic PEG description is essentially the root math of it, but adhering to it strictly is a recipe for inefficiency and lack of abilities. What’s been learned by the industry over the past 20 years has come up with a few patterns that have become necessities. :slight_smile:

State context handling is like think of parsing out an integer at one node, passing it ‘into’ potentially many nodes deep, where that gets added to another parsed integer, that then gets returned ‘up’ a multitude of nodes, where this is important is think about parsing a programming language where the syntax changes depending on a ‘type’ of something that got parsed or defined earlier, like think of this:

s = something()
s.blah

To determine what the s.blah should be doing, say a map lookup, a module function invocation, etc
 depends on the type of s, which depends on the type of what something() is, which requires something() to have been typed prior (this is similar to how some ML languages work, not OCaml as it uses unique operators for everything though). Now imagine it is not just a simple . call but some entirely different branching structure, so you have to dispatch to different parsers based on some runtime type information of the system. In a modern PEG parser that is able to be figured out during parsing without ever needing to re-dispatch in to the parser. Like take the TreeMap in ExSpirit, it is designed as it is because it can be stored in state, so it could start empty, but then someone defines something in the language called my int and whatever comes after it should be parsed as an integer and passed in to it, all you have to do is add it to that and it ‘just works’. The Treemap (my implementation is very much a direct translation of the ones I’ve written in C++, I could write it more efficiently in a more Elixir’y way in Elixir) is also a common pattern in modern peg libraries.

A great example of a modern PEG library is C++'s Spirit2, a rewrite of Spirit1 based on everything learned prior (which made it the fastest PEG library anywhere, because it can quite literally optimize it’s constructs in very interesting ways). If you want to see a PEG library that is in heavy use and crazy efficient then that is the one to look at (Antlr is not bad though it focuses more on streaming so is not as efficient and thus different design considerations). An extra library of spirit-components is available as well that contains some less generic and more specialized pre-built elements like keyword handling (as I spoke above), confix (basically for handling things like beginning/ending tags more succinctly), distinct (like parsing if but not matching iffy based on a non-matching tail, among others, but all are built on the base Spirit library.

Spirit handles state far more cleanly than I do in ExSpirit (which I could emulate with more macro work), essentially given something like little_dword[_a = _1] >> advance(_a) >> little_dword it parses a little-endian dword (Spirit works on generic binaries, not just text, think Elixir/Erlang binary parsing with a lot more), then takes the first element of the result of it (most things only have one output, it fails to compile if you access an element that doesn’t exist for the given parser, it’s type safe), designated by the _1 and assigns it to a rule-specific ‘state value’ here names _a, which is then passed ‘into’ the advance component, which then skips that number of bytes, I.E. this parses a dword-sized ‘size’ header like 00 00 00 04 and then skips that many bytes to skip over this record, then it parses another little-endian dword (this you can do in Elixir binary matching, hence why I’m using this example to make it easy to reason about how Elixir/Erlang binary matching/parsing has similar features to modern PEG state handling).

The advance(_a) is passing the data ‘into’ the rulestack, it could be your own rule that takes the argument and uses it inside it as it wishes as well, and it returns data by packing them into result elements, this is why ExSpirit allows you to store rule specific named state that doesn’t exist in inner rules and isn’t passed up, but you can return data via result and you can pass data ‘into’ via arguments, modeling the same as C++'s Spirit.

This page on Spirit’s docs talks about the basic PEG setup, which is basically just:

  • Sequences: Parse the first then the second
  • Alternatives: Parse the first, if it fails then parse the second
  • Loops: Parse a number of times until the parser fails then concat all successful parses up to then
  • Difference: Parse something that does not match something else (this is the basic of the lookahead).

Now these are not always the most efficient, especially Difference’s performance is pretty abysmal, hence why specialized things like lookahead and made.

Spirit splits up the ‘types’ of it’s components into a few kinds:

  • Primitives: A simple thing like int_ that parses and returns data, it can potentially take arguments (like int_ accepts things like size, range, radix, etc
 etc
) but it does not accept any components as arguments.
  • Compound: Like primitives but they also accept other components, so this would be like a repeat operator (repeat in ExSpirit, * or + in C++ Spirit) or even the basic sequence (which takes multiple ordered components by default, seq or |> in ExSpirit and >> in C++).
  • Rule: A collection of components scoped, this is like ‘naming’ a set of of components, essentially it corresponds to a function in C++ except it’s defined in the type system instead of as a runnable function (defrule in ExSpirit, rule<Iterator, return_type(rule, args, here)> in C++). This also provides a point to separate the code so the compiler can choose whether to inline it or not where used (where all compound and primitive components are force inlined for efficiency). Say you made a rule like rule<Iterator, vector<int>()> ints = int_ % ',' in C++ Spirit that parses a comma separated list of integers and returns them in an array named ints then you could use it elsewhere like any other component like "list: " >> ints >> eol or so. If the compiler doesn’t optimize it then there is a slight cost of runtime speed, however it does reduce compilation times by using more rules.
  • Grammar: This is a collection of rules, it adds a full synchronization point, grammars are not eligible for being inlined across other grammars. Generally you only have one grammar but if you have a huge language to parse then you can split up multiple grammars to add points where the compiler is not allowed to optimize across, which can SIGNIFICANTLY improve compilation time in complex parsers at the cost of a bit of runtime speed.

So the basic syntax of a component is component_name or component_name() (either works, though if it has arguments to pass in then you must use the component_name(args...) form).

Every component can have an associated semantic action delinated by [/], so something like int_[&blah] will call the function blah with the result of the parsed integer if the blah is typed as int blah(int), if blah is otherwise typed as int blah(context&, int) (maybe the other order, I forget
) then it gets the full context data passed in so it can access locally defined state variables, parse location (column/line/etc
) and a lot more information. The above [_a = _1] is essentially just defining an inline lambda function of [[](auto context, auto val) {context.putState<_a>(val)} ] or something like that (that’s not a spirit feature, rather a phoenix library feature, which is very often used with spirit due to the succinctness of it’s lambdas). A semantic action can be any ‘callable’ type that fits any of the proper signatures (or it fails to compile).

The basic set of components included with Spirit directly (either to significantly reduce repeated code or for efficiency reasons because the basic PEG constructs are way too inefficient) are mostly defined here. The usual ‘efficient’ ones like character and string and number parsers are of course included, among others, but the especially interesting ones are the parser directives, this includes things like:

  • lexeme: disables the skip parser but pre-skips
  • no_skip: disables the skip parser with no pre-skip
  • raw: Parses however but returns the entire range of source text instead, regardless of the returned values. Without this then parsing twice is necessary.
  • expect: Throws an exception with the currently set error if there is an error, otherwise no-op. Without this then errors will not be accurate in many cases.
  • repeat: Min-max specific repeating call. Without this input can be repeatedly parsed when it was not necessary.
  • Among many others.

Spirit also defines the concept of a skip_parser, this is a parser passed into the parse call to act as an omitted skipper that is auto-placed between every terminal component (a component that parses a base thing like ‘int_’, but something like repeat would run the skip parser between each iteration and before the first). This is often used with spaces for example, so given parse(input.begin(), input.end(), int_ % ',', blank) would parse 1 , 2 , 3, 4,5,6 , 7 the same as 1,2,3,4,5,6,7, but without the blank skip parser then only 1,2,3,4,5,6,7 would parse and any spaces would fail. This is generically useful for a lot of things and implementing it on the user-code otherwise means adding in a whole LOT of needless noise among every-single-parser that is created. The skip parser can be prevented from working via lexeme/no_skip in C++ Spirit for a given component.

So yes in general the PEG ‘spec’ is simple and doesn’t have much, but it is also crazy inefficient, modern PEG libraries have found patterns that work around it to get a more useful base set of components.

Another thing Spirit can do is also let you define a ‘parser’ that runs the other way, you feed it the data structure of whatever internal format you have and it sends out a binary stream. That is not a feature I added in yet in ExSpirit but it was open to it.

Basically yeah, there is a difference between PEG the Spec and PEG as is implemented in modern libraries, I tend to use what is done in modern libraries as that is the part that is actually useful considering how dreadfully inefficient the base Spec is.

Correct, but it is such a universal ‘convenience’ that it is often baked into about every library as it can both substantially improve error messages while also improving performance.

Eh, the first (state handling) is not so much a convenience, it is what allows PEG’s to go from Context Free parsers to Context Sensitive. There are many ways to handle state, some more efficient then others of course.

Efficient XML parsing is the traditionally ‘simple’ use-case, it is context sensitive to match the closing tag with the opening tag, otherwise you need a second pass over the complete data. Now the second pass (like using a semantic action in C++ Spirit) can of course be used to enforce this specific case, but in actual programming languages it is often common for the parser to ‘change’ based on state, and thus the XML example is a simplified form of that where the parser changes (what the ending tag matches to) based on prior state (what the opening tag was parsed as), so being able to do that without the ending pass, I.E. being able to change the actual parser functionality based on runtime state is highly important for parsing most languages efficiently (otherwise you tend to have to go to a multi-stage parser where you first tokanize, then fixup the tokens, then generate the AST, then convert the AST to the internal format, where modern PEG libraries can generate the internal format directly without needing to generate the huge amount of intermediate data).

Correct, you’ll have to verify the end tag via a function after the parsing of the (anything) end tag is complete, simple in the XML case, but not so in more complex cases like parsing branches differently based on the types of parsed data.

Super popular with interpreters, like implementing a mini language that calculates the values as it is parsed, so the returned parsed value is the result of the calculation, regardless of any variables set, scopes, etc
 Local is being able to hold data within a specific rule, like what the value of the parsed opening tag was in XML so you can reuse it to adjust the parser for the closing tag.

To expand on this, say a parser for the Elixir language was made in C++ Spirit or ExSpirit, you can create a new ‘tag’ function or component that wraps the result in a 3-tuple like Elixir’s AST is, with the middle context metadata including the line and column parsing information, perhaps the byte location in the input stream, etc
 This is something not currently possible in NimbleParsec as it doesn’t have such context information.

This is posible in NimbleParsec! You can add line and column metadata with no problems since about the second version came out. Are you sure you’ve looked at the most recent docs? You can also add information from the context if you wish. Tagging the generated AST with NimbleParsec, although not as convenient as with ExSpirit (IMO), is perfectly possible.

Here is an example: nimble_parsec/examples/simple_xml.exs at master · dashbitco/nimble_parsec · GitHub

1 Like

The eaisest way to explain what’s missing is to look at the spec of the post_traverse function (the new name for the old traverse function): post_traverse(combinator \\ empty(), to_post_traverse, call) the call argument is a function, which takes the parser state as an argument (including the context, of course) and it must return a 2-tuple of the form {acc, context}, where the context is the new context and acc is the result you append to the list of results. This does not allow you to consume any input. You can do things to the result depending on the input that has already been consumed and on the input you haven’t yet consumed, but you can’t consume nay new input.

This means you can’t change the input you’re consuming based on the contexts. For NimbleParsec to allow context-sensitive combinators, the call function would have to build a new combinator at runtime and apply it to the rest of the input. Let’s say you want to parse a grammar of the form <n>:<n integers separated by commas>. An example is this: 5:1,4,55,6,7. To parse this, you have to consume the character ?5, and then build a new parser that recognizes exactly 5 integers (it could fetch the number 5 dynamically from the context). This is an example of a context-sensitive language, and one which I think NimbleParsec can’t handle currently.

Now, implementing a state system like this sounds like a lot of work, because you’d have to have “dynamic” versions of the combinators (besides the “static” ones you already have). This complicats the design a lot, and stateful combinators are much harder to optimize into the kinds of pattern matches that NimbleParsec is using now.

2 Likes

Yeaaaaaaaaah
 doing this would be really tricky.

We could have a dynamic choice, where you could choose at runtime which combinator to dispatch to:

dynamic([
  foo: combinator1,
  bar: combinator2,
], mfa)

And the mfa returns either :foo or :bar but it is still limited because the combinators pointed out by :foo and :bar still aren’t dynamic, so you can’t say something should happen exactly 5 times. You would still have to write a code in a way that parses n entries and then check afterwards (using post_traverse) that you had exactly 5 entries. I have no idea if this is good enough though.

Thanks @tmbb!

Exactly, only it needs to be a little more general:

dynamic(fn context -> my_dynamic_combinator(context) end)

And somehow the returned combinator must be interpreted at runtime by the parser (instead of compiled into something much more efficient). With a slight change to post_traverse (or with the addition of a new “primitive” combinator) I believe I could implement context-sensitive parsers as “user-level” code.

I might work on it if I get some free time

I have not apparently, when I wrote my benchmark where NimbleParsec bested ExSpirit handedly it did not exist at the time! Woot!

To be honest I find that file pretty long and unreadable, I’m not sure that’ s a good example. Lots of seemingly random functions doing things that the parser should do and so forth. I’m pretty sure I could write a set of matcher heads to parse the same in significantly shorter space? Hmm, let me write a parser how I usually did in erlang:

defmodule SimpleXML do
  def parse(inp) do
    {rest, [xml]} = xml({inp, 1, 1, 0})
    {:ok, xml, rest}
  catch s when is_binary(s) -> {:error, s}
  end

  def xml({"<" <> inp, l, c, o}) do
    {inp, name} = tag({inp, l, c+1, o+1})
    xml_head(inp, name)
  end
  def xml(inp), do: {inp, []}

  def xml_head({">" <> inp, l, c, o}, name), do: xml_rest({inp, l, c+1, o+1}, name)

  def xml_rest(inp, name, children \\ [])
  def xml_rest({"</" <> inp, l, c, o}=ct, name, children) do
    {inp, ^name} = tag({inp, l, c+2, o+2})
    xml_done(inp, name, children)
  rescue MatchError -> err(ct, "Invalid closing tag does not match start tag of #{name}")
  end
  def xml_rest({"<" <> _inp, _l, _c, _o} = ct, name, children) do
    {ct, xml} = xml(ct)
    xml_rest(ct, name, xml ++ children)
  end
  def xml_rest({<<_::integer-size(8), _::binary>>, _l, _c, _o} = ct, name, children) do
    {ct, txt} = text(ct)
    xml_rest(ct, name, txt ++ children)
  end
  def xml_rest(ct, name, _children), do: err(ct, "XML Parsing failure at #{name}")
  
  def xml_done({">" <> inp, l, c, o}, name, children) do
    {{inp, l, c+1, o+1}, [{name, [line: l, column: c, offset: o], :lists.reverse(children)}]}
  end
  
  def tag(inp, t \\ "")
  def tag({<<ch::integer-size(8), inp::binary>>, l, c, o}, t) when ch in ?a..?z or ch in ?A..?Z, do: tag({inp, l, c+1, o+1}, <<t::binary, ch::utf8>>)
  def tag(inp, t) when byte_size(t)>0, do: {inp, String.to_atom(t)}
  def tag(ct, _t), do: err(ct, "Empty tag not allowed")
  
  def text({inp, l, c, o}), do: text(inp, inp, l, c, o)
  def text(orig, "\n" <> inp, l, _c, o), do: text(orig, inp, l+1, 1, o+1)
  def text(orig, "<" <> _ = inp, l, c, o), do: text_done(orig, inp, l, c, o)
  def text(orig, "", l, c, o), do: text_done(orig, "", l, c, o)
  def text(orig, <<_::integer-size(8), inp::binary>>, l, c, o), do: text(orig, inp, l, c+1, o+1)
  
  def text_done(oinp, inp, l, c, o) do
    offset = byte_size(oinp) - byte_size(inp)
    <<result::binary-size(offset), _::binary>> = oinp
    {{inp, l, c, o}, [result]}
  end
  
  def err({inp, l, c, _o}, msg), do: throw "ERROR at #{l}:#{c}: #{msg}\nTrailing Data: `#{String.slice(inp, 0, 24)}`"
end
inputs = [
  "<foo></foo>",
  "<foo><bar></bar></foo>",
  "<foo>bar</foo>",
  "<foo><bar>baz</bar></foo>",
  "<foo><bar>one</bar><bar>two</bar></foo>",
  "<>bar</>",
  "<foo>bar</baz>",
  "<foo>bar</foo>oops",
  "<foo>bar"
]
for input <- inputs do
  IO.puts(input)
  IO.inspect(SimpleXML.parse(input))
  IO.puts("")
end

Running it gives:

<foo></foo>
{:ok, {:foo, [line: 1, column: 11, offset: 10], []}, {"", 1, 12, 11}}

<foo><bar></bar></foo>
{:ok,
 {:foo, [line: 1, column: 22, offset: 21],
  [{:bar, [line: 1, column: 16, offset: 15], []}]}, {"", 1, 23, 22}}

<foo>bar</foo>
{:ok, {:foo, [line: 1, column: 14, offset: 13], ["bar"]}, {"", 1, 15, 14}}

<foo><bar>baz</bar></foo>
{:ok,
 {:foo, [line: 1, column: 25, offset: 24],
  [{:bar, [line: 1, column: 19, offset: 18], ["baz"]}]}, {"", 1, 26, 25}}

<foo><bar>one</bar><bar>two</bar></foo>
{:ok,
 {:foo, [line: 1, column: 39, offset: 38],
  [
    {:bar, [line: 1, column: 19, offset: 18], ["one"]},
    {:bar, [line: 1, column: 33, offset: 32], ["two"]}
  ]}, {"", 1, 40, 39}}

<>bar</>
{:error, "ERROR at 1:2: Empty tag not allowed\nTrailing Data: `>bar</>`"}

<foo>bar</baz>
{:error,
 "ERROR at 1:9: Invalid closing tag does not match start tag of foo\nTrailing Data: `</baz>`"}

<foo>bar</foo>oops
{:ok, {:foo, [line: 1, column: 14, offset: 13], ["bar"]}, {"oops", 1, 15, 14}}

<foo>bar
{:error, "ERROR at 1:9: XML Parsing failure at foo\nTrailing Data: ``"}

And I find it both a lot shorter and more readable, here is the ExSpirit version in comparison:

defmodule SimpleXML do
  use ExSpirit.Parser, text: true

  defrule text( chars(-?<) )

  defrule tag_name( chars([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]) )

  defrule tag(
    lit(?<) |> tag_name() |> put_state(:tagname, :result) |> lit(?>) |> expect(seq([
      get_state_into(:tagname, tag(&1, repeat(node_()))),
      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)
    ]))
  )

  defrule node_(
    alt([
      tag(),
      text(),
    ])
  )

  def from_string(input) do
    parse(input, node_())
  end
end

Which I find even shorter still and significantly more readable than both the prior. If the parser doesn’t gain any code over doing it just manually via normal pattern matching, than why use the library?

Hmm


defmodule Testering do
  use ExSpirit.Parser, text: true

  def parse(input) do
    parse(input,
      uint()
      |> put_state(:count, :result)
      |> lit(?:)
      |> seq([uint(), get_state_into(:count, repeat(lit(?,) |> uint(), &1-1, &1-1))])
    )
  end
end
Testering.parse("5:1,4,55,6,7")
Testering.parse("5:1,4,55,6,7,8,9")

Result being:

iex(21)> Testering.parse("5:1,4,55,6,7")
%ExSpirit.Parser.Context{
  column: 13,
  error: nil,
  filename: "<unknown>",
  line: 1,
  position: 12,
  rest: "",
  result: [1, 4, 55, 6, 7],
  rulestack: [],
  skipper: nil,
  state: %{count: 5},
  userdata: nil
}
iex(22)> Testering.parse("5:1,4,55,6,7,8,9")
%ExSpirit.Parser.Context{
  column: 13,
  error: nil,
  filename: "<unknown>",
  line: 1,
  position: 12,
  rest: ",8,9",
  result: [1, 4, 55, 6, 7],
  rulestack: [],
  skipper: nil,
  state: %{count: 5},
  userdata: nil
}

As you can see in the rest it didn’t parse anything beyond the number, leaving it for later parsers. And with a simple alt I can make it handle the 0 case as well (assuming that should even be a valid parse).

  iex(1)> import ExSpirit.Parser
  iex(2)> import ExSpirit.Tests.Parser
  iex(3)> alias ExSpirit.TreeMap, as: TreeMap
  iex(4)> symbol_TreeMap = TreeMap.new() |> TreeMap.add_text("int", &uint(&1)) |> TreeMap.add_text("char", &char(&1))
  iex(5)> context = parse("int42", symbols(symbol_TreeMap))
  iex(6)> {context.error, context.result, context.rest}
  {nil, 42, ""}
  iex(7)> context = parse("charT", symbols(symbol_TreeMap))
  iex(8)> {context.error, context.result, context.rest}
  {nil, ?T, ""}

That’s the simple form of dynamic dispatch, you can even add and remove things from the treemap during parse time, including generating new parsers.

Although if you just want to generate a parser you can do that in a normal defrule anyway (I should add spirit’s lazy component to simplify that
).

That specific function is called lazy in C++'s Spirit. :slight_smile:

Eh, it should just be an anonymous function call in terms of cost I’d think?

Yes, I’m perfectly aware of how trivial it is to do this in ExSpirit. After al, I’ve used the context-sensitive features of ExSpirit to write a pretty cool HTML lexer which was capable of highlighting matching opening and closing tags, something which seems impossible with NimbleParsec. The question is how to implement these features in NimbleParsec’s architecture.

I guess so. But the main problem is not performance. The way NimbleParsc works is by generating an AST for a parser and then compiling the whole thing. It makes use of the fact that some things are static and others are dynamic to optimize code generation (last time I looked, it might have changed). With something like this, you have to have an anonymous function than generates a combinator at runtime, which is something compeltely different from what NimbleParsec is doing.

It might require you to have two classes of componentes, some static and some dynamic. But then again, it might not. I could see something like:

integer()
|> convert_to_integer()
|> put_state(:count, :result)
|> ...
|> get_state_into(:count, something)

where something is a combinator that (statically) returns an expression, which the get_state_into macro would compile into something different. Basically get_state_into would be an alterantive compiler for the same kind of expressions which are compiled by defparsec.

If you’re smart when coding get_state_into, you can optimize things so that most of the work is done at compile time instead of at runtime. I haven’t seen how ExSpirit does it though (I’ll look at it ASAP, the code seems very simple).

I’ll admit that I haven’t thought this through completely, but would it not be possible to implement something like the n:x,x,x,x,x case by first reading n, putting the parsed integer into the parser state (the context), and then simply having a construct which looked something like this: put_0_into_context_as_x() |> repeat(assert_x_less_than_n() |> integer() |> increment_x_in_context()), therefore using dynamic state to fail or succeed a parser dynamically, but not actually generating any new parsing trees at runtime? In general, for more dynamic things, you could also generate a choice some of whose clauses would fail immediately depending on the current context. It seems at least like a lot of common cases could be implemented like this but perhaps I’m missing some limitation.

You don’t have to generate new parse trees at runtime. You just have to return a combinator at runtime. I really hope you can do that without generating and compiling an entire parsing tree! My idea (maybe I didn’t explain myself properly) didn’t generate a tree at runtime. It would generate a tree at compile time and compile that into a higher order function which returns a combinator. Or something like that.

I’d like to avoid solutions of the form “a lot of common cases”, because if you’re working only on the common cases you might be locking yourself into a place you can’t escape later. I’m a little short on time, but my basic idea is the following:

  1. Write a version of post_traverse, which returns a “full” result tuple, not only a {acc, context} pair. That is, turn post_traverse into something closer to a normal combinator

  2. That means it could consume input based on the function passed as an argument (which it can’t do now!)

  3. Because the function you pass into post_traverse can be literally anything, you can then implement this functionality as “user-level” code outside of nimble_parsec.

The way you implement that is not very relevant. The most important is thatyou’d have an avenue to explore compilation of context-sensitive parsers without any changes to nimble_parsec’s core. At least that’s the basic idea, I don’t have the time to code it right now. EDIT: I’m actually a little sleep deprived, so just because this seems like a good idea it doesn’t mean it’s actually a good idea.

Yeah
 that’s a no-no because it pretty much defeats the point of NimbleParsec. :slight_smile:

I am pretty sure I said this before but I don’t mind it. But let’s be honest here, your hand-written example is not really shorter than NimbleParsec’s. The NimbleParsec one was written with readability in mind. If we inline NimbleParsec’s code (using a similar style to yours) and run the formatter on both, NimbleParsec is definitely shorter. And if we update NimbleParsec to also use the new eos stuff and what not, then it comes down to this:

defmodule SimpleXML do
  import NimbleParsec

  def parse(xml, opts \\ []) do
    opts = Keyword.put(opts, :context, %{tags: []})
    xml(xml, opts)
  end

  tag = ascii_string([?a..?z, ?A..?Z], min: 1)
  text = ascii_string([not: ?<], min: 1)

  opening_tag = ignore(string("<")) |> concat(tag) |> ignore(string(">"))
  closing_tag = ignore(string("</")) |> concat(tag) |> ignore(string(">"))

  defcombinatorp :node,
                 opening_tag
                 |> post_traverse(:store_tag_in_context)
                 |> repeat(lookahead_not(string("</")) |> choice([parsec(:node), text]))
                 |> wrap()
                 |> concat(closing_tag)
                 |> post_traverse(:check_close_tag_and_emit_tag)

  defparsec :xml, parsec(:node) |> eos()

  defp store_tag_in_context(_rest, [tag], %{tags: tags} = context, _line, _offset),
    do: {[tag], %{context | tags: [tag | tags]}}

  defp check_close_tag_and_emit_tag(_rest, [tag, [tag | contents]], context, _line, _offset) do
    context = update_in(context.tags, &tl/1)
    text_or_nodes = if match?([_], contents), do: hd(contents), else: contents
    {[{String.to_atom(tag), [], text_or_nodes}], context}
  end

  defp check_close_tag_and_emit_tag(_rest, [opening, [closing | _]], _context, _line, _offset),
    do: {:error, "closing tag #{inspect(closing)} did not match opening tag #{inspect(opening)}"}
end

Although I wouldn’t personally write in the style above.

But of course, the ExSpirit one is remarkably concise, which is expected as the libs have different goals.

Maybe, BUT this is just for the context-sensitive combinators. I believe that this can be made to work without any changes to the “normal” combinator. It makes sense to me that stateful parsing will be less efficient than the highlght optimized things that NimbleParsec does now. That’s the cost of having a more powerful parser.

As long as this doesn’t make the rest of NimbleParsec less efficient, I think it would make sense to include it. But my bet is that you wouldn’t need to. I think I can add a new combinator to NimbleParsec which will allow users to implement context-sensitive parsers without any changes to NimbleParsec.

Also, don’t think of “interpreting the parser” as something very slow and inefficient. You don’t need to interpret the parsing tree at runtime (I hope
). As @OvermindDL1 said, it might just cost the same as a function call, which is perfectly acceptable.

1 Like

Maybe I am completely missing the approach you have in mind but NimbleParsec emits function clauses. To evaluate them, you would need to either transform those clauses into cases and control the state machine yourself OR write a hand-written evaluator for each combinator.

If we are interested in going in this direction, then I would rather try the approach I mentioned earlier, where we teach each existing combinator how to read things from context, because that would definitely be less work. So you could do: times(..., min: :context_key) and we get the min from the context, which would mix well with the dynamic combinator proposed earlier.

EDIT: You can also always define your custom functions too, that don’t know about nimble_parsec at all, and hook them in with the parsec(:function_name) combinator.

I’m aware of that. That’s why you’d have to expand the AST inside a macro so that the macro could compile the AST into something else, instead of function clauses remember that anything given to a macro can be expanded and compiled into something else. Although there might be sole problems with variable contexts, ao maybe this is not a good idea at all


You’re probably right. Although I suggest a more “typesafe” approach in which you write something like this:

times(..., min: from_context(:key))

And even something like:

times(..., min: from_context(:key, fn k -> ... end))

The from_context/2 function would just return a tuple which the parsec compiler would know how to handle and compile.
Which transforms the results prior to useing them. This is very similar to what I do in vampyre! In both cases one has to try to optimize the static parts as open as possible and defer the evaluation of the dynamic ones so that they happen at runtime.

1 Like

Yeah that’s what I was trying to think, couldn’t think of a way without needing to write a second stage function to post-process the parsed data, but that could easily parse ‘too much’ and waste time


I actually find the pattern matching style extremely readable. It’s not always the easiest to follow because of how much it jumps around, but it is quite readable, the heads are encoding essentially all of the parsing. :slight_smile:

Ah that is much more readable, but the check_close_tag_and_emit_tag function still seems very odd, like it itself is throwing the error instead of the parsing failing to parse. I’m curious if it returning that error tuple if the parser then would try other alternate paths if they were defined?

It would be even more concise in C++ by a significant margin (ExSpirit’s is easily 3 to 4 times the size on average), but then again the backend of Spirit in C++ is exceptionally large (it’s essentially a compiler written in the C++ template meta-language, so that’s a place Elixir can be MUCH more readable on!). ^.^

I had considered doing compile-time string parsing in ExSpirit at first but I thought I’d try to stick to the AST at least for version 1.

For parser libraries though I’ve always seen the purpose of all of them that I’ve seen is to have a very concise front-end for defining grammar’s since grammar’s tend to be exceptionally hairy overall, so minimizing everything related to the grammar has always been well worth it. I don’t reach for ExSpirit after all to do parsing of a simple integer list, but rather for parsing complex language definitions, often for shorter things then simple pattern matching is more effect and more readable then needing to bring in another dependency just to do some parsing in 3 lines instead of 15, but when the different is 300 lines compared to 20000 then it is a significant difference.

For a stateful parser this closing_tag would need to be able to take in an argument at parsing time and replace the concat(tag) part with a lit(&1) where &1 is the passed in argument. With that simple change in the user-side definition then the great majority of defined functions there then vanish, from store_tag_in_context to all of check_close_tag_and_emit_tag except for it’s tuple building part (and whatever text_or_nodes = if match?([_], contents), do: hd(contents), else: contents is doing, where-ever the list version of contents comes from it seems like the parser should handle returning it in a normalize format with the rest of the children). That’s a significant amount of user code that no longer needs to exist.

It makes sense to me that stateful parsing will be less efficient than the highlght optimized things that NimbleParsec does now. That’s the cost of having a more powerful parser.

No reason for it to not be except for just taking the time to build the ‘compiler’ to convert from the DSEL to a set of functions to handle it directly (instead of the macro-munging that ExSpirit does currently, I.E. what Parsex was going to do).

What C++'s Spirit ends up doing is taking the DSEL used in the type system (in Elixir that’s just grabbing the AST at compile-time), using templates to walk the DSEL’s AST to generate a couple of sets of data of lifetime information for the ‘arguments’ used, holding the types, normalizing the return types, etc
 etc
 (in Elixir that’s just walking the AST at compile-time to generate data for making the functions), then it runs optimization passes basically a pattern matcher over the processed AST and values to find patterns to reduce to more efficient patterns), then lastly the templates build a single inlineable function per rule that passes the data via direct argument passing (in Elixir that’s basically what NimbleParsec is doing but with the added information of arguments and more stuff to generate specific code for each type of thing), and this goes right down to even something as simple as parsing characters where the char_ parser gets any part of it replaced with specific values (say matching “a-z”) and any variables are injected in-place, the C++ compiler then does it’s normal optimization of folding constants (since the template ‘processor’ injects constants in every place it can), which let’s it end up having a call like parse(input.begin(), input.end(), char_ >> int_) take literally a few dozen assembly instructions all inline (if the input type is held in order in memory, otherwise it generates streaming calls and so forth, which is more, but still efficient).

This same pattern is entirely doable in Elixir (and was planned in ExSpirit, I just wanted something workable at first for my own work), it just needs to be done. :slight_smile:

As a side note, those ‘optimization passes’ mentioned above is why C++'s Spirit is and will always remain the fastest parser, because anything any other parser does better they can see how it generates the output assembly and make a new optimization pass to optimize for that pattern to at least meet if not exceed the original. This is all while it is possible for users to create their own (even root terminal) constructs (as a bit of a template horror but thankfully only ever needs to be done once). It’s entirely doable in Elixir much more simply. :slight_smile:

Which is why I didn’t enter the merit of readability (it is personal) and focused on code size. :slight_smile: And I just noticed we actually don’t use the tags stored in the context, so now the whole thing can be written as:

defmodule SimpleXML do
  import NimbleParsec

  defparsec :parse, parsec(:node) |> eos()

  tag = ascii_string([?a..?z, ?A..?Z], min: 1)
  text = ascii_string([not: ?<], min: 1)

  opening_tag = ignore(string("<")) |> concat(tag) |> ignore(string(">"))
  closing_tag = ignore(string("</")) |> concat(tag) |> ignore(string(">"))

  defcombinatorp :node,
                 opening_tag
                 |> repeat(lookahead_not(string("</")) |> choice([parsec(:node), text]))
                 |> wrap()
                 |> concat(closing_tag)
                 |> post_traverse(:match_and_emit_tag)

  defp match_and_emit_tag(_rest, [tag, [tag, text]], context, _line, _offset),
    do: {[{String.to_atom(tag), [], text}], context}

  defp match_and_emit_tag(_rest, [tag, [tag | nodes]], context, _line, _offset),
    do: {[{String.to_atom(tag), [], nodes}], context}

  defp match_and_emit_tag(_rest, [opening, [closing | _]], _context, _line, _offset),
    do: {:error, "closing tag #{inspect(closing)} did not match opening tag #{inspect(opening)}"}
end
3 Likes

Ah much nicer!

I’m curious, I’m guessing line is the line number, what is offset, is it the column number or is the position in the overall file? Both values are useful for various purposes.