XPeg - Powerful Elixir PEG parser generator library

Hereby I’d like to announce a new library I’ve been working on over the last few months: XPeg.

XPeg is a pure Elixir pattern matching library. It provides macros to compile PEG grammars to an Elixir function which will parse a string and capture selected parts of the input. PEGs are not unlike regular expressions, but offer more power and flexibility, and have less ambiguities. More about PEGs on wikipedia.

Some use cases where XPeg is useful are configuration or data file parsers, robust protocol implementations, input validation, lexing of programming languages or domain specific languages.

XPeg allows mixing of the grammar definition with Elixir functions that act on the matched data, allowing for powerful and concise AST generating parsers.

Example

Below is a simple grammar definition that parses a comma separated list of key/value pairs into a list of tuples:

p = Xpeg.peg :dict do
  :dict <- :pair * star("," * :pair) * !1
  :pair <- :word * "=" * :number * fn [a,b | cs] -> [{b,a} | cs] end
  :word <- cap(+{'a'..'z'})
  :number <- cap(+{'0'..'9'}) * fn [v | cs] -> [String.to_integer(v) | cs] end
end

This grammar consists of the following rules:

  • The top level rule :dict matches one :pair, followed by zero-or-more instances of a , followed by a :pair
  • The :pair rule matches a :word followed by an = and a :number
  • The :word rule matches one-or-more characters from the set {'a'..'z'}
  • The :number rule matches one-or-more characters from the set {'0'..'9'}

Some rules are followed by elixir functions that convert or transform the captured data at parse time, resulting in the required AST syntax.

The grammar can be matched against the subject string using the Xpeg.match() function:

Xpeg.match(p, "grass=4,horse=1,star=2")

resulting in the following output:

[{"star", 2}, {"horse", 1}, {"grass", 4}]

Below are some links to more elaborate examples from the GitHub repository:

12 Likes

This looks really nice! I’m mostly just asking for conversation, but can you compare this with NimbleParsec? When would one prefer one over the other? Performance?

Good luck!

Honestly, I’m not at all acquainted with NimbleParsec, or any other available parsers for Elixir - I would have to dive into the alternatives to see how they would compare. One of the strengths of XPeg would be the concise way to build ASTs directly from the grammar, but other parsers might as well offer the same functionality. Performance is currently not great, as there is a number of possible optimizations that I have not yet implemented.

I initially wrote XPeg as an exercise to acquainted with Elixir metaprogramming, the implementation itself is basically a Elixir port of a (pretty much mature) Nim parser I made a a few years ago: NPeg, which is itself based on Lua’s LPeg.

Apologies for bumping my own thread, but I’d like to mention that Xpeg has learned some nice new tricks over the last few weeks. Most important changes:

  • Performance has been improved drastically; my typical benchmark is parsing JSON into Elixir maps and lists, which now runs at about half as fast as the highly optimized and blazing fast Poison parser.
  • Xpeg can now draw cool railroad graphs for the grammar that it is compiling, which is nice and helpful for understanding and debugging your grammars. for example:

This grammar fragment:

  Obj_pair <- S * String * S * ":" * Value                                                       
  Object <- "{" * (Obj_pair * star("," * Obj_pair) | S) * "}" 

Will be dumped like this railroad diagram at compile time:

                               ╭───────────»──────────╮                                            
Object o──'{'─»─┬─[Obj_pair]─»─┴─┬─","─»─[Obj_pair]─┬─┴──┬─»─"}"──o                                
                │                ╰─────────«────────╯    │                                         
                ╰─[S]────────────────────────────────────╯  

For more info, check the README on the Xpeg github repo

8 Likes

It looks like the latest release in hex has errors and warning with Elixir 1.16. It looks like the code is fixed in GitHub but not released in to hex yet.

Thanks for the heads up, 0.9.0 has just been released.

1 Like

Nice! And thanks. Xpeg looks pretty cool. I think I’m going to try it for my library Mozart - a somewhat new take on BPM. I want to create a very readable textual syntax for my current struct based syntax. It might take me a while.

The last time I did any serious parsing was using Bison and Flex. Since Flex took care of tokenization, you didn’t have to account for spaces in your parse rules. It seems like this is not the case for Xpeg, NimbleParsec, etc. Do I have that right?

Hello @zevv

I just tried to take 0.9.0 for a spin, but ran into a problem. I submitted an issue:

https://github.com/zevv/xpeg/issues/10

True. When parsing a document at the character level with a PEG parser, you need to explicitly handle the white space in your grammar. For some grammars this might sound like a bad thing (more work), but the advantage is that you can handle white space any way you want when it makes sense, for example when parsing indent-based languages.

In practice, I usually define a short symbol like S that matches one or more spaces, tabs or newlines and sprinkle that through my grammar.

That said: PEGs can be more flexible, if the implementation allows. XPeg is basically a port of my original nPeg parser for the Nim language [GitHub - zevv/npeg: PEGs for Nim, another take]. nPeg is able to parse grammars for sequences of any types, not only strings. I use this to create more classical lexer/parser stages: a first PEG grammar consumes characters and delivers a sequence of tokens, a second PEG grammar parses the tokens into an AST. I think it should be possible to extend XPeg to do something similar, if the need arises.