This library converts a PEG parser library to nimbleparsec parsers. You can “hook” extra functions on to the combinators generated by the PEG language.
Unlike yecc/leex, This PEG grammar is extremely easy to read and write, given ABNF descriptions given in most RFCs or other standards documents (e.g. ECMA). This also leverages the extremely effective compile-time nature of the NimbleParsec library.
Wow, I didn’t realize I needed this library. Much, much easier to use than regular expressions. PEG seems like it would be an excellent addition to the core library as an alternative to Regex.
The following code is me just messing around trying to get things to work, but I hope it helps someone.
I don’t know what to tell you, I definitely don’t think I’ve gotten smarter, but I tried using these parsers (BNF/ABNF/lex) and I just couldn’t get them to do what I want. I remember very clearly my frustration because I had to spend a few days writing a custom parser.
However, for some reason, I seem to be able to use PEG just fine, so thank you for the library. Damn, I’d give you five likes for making the library, but I only have one to give. And nimble_parsec is also brilliant.
To amuse myself further with this discovery of PEG, I did some benchmarking on parsing emails. Turns out that they are quite comparable in speed and PEG is a tiny bit faster (5-10%) and a lot more consistent on the P99. Hopefully, there’s nothing really funky happening in the benchmark. I’ve put the code and benchmark way down below.
The ridiculous thing I found about this experiment is that the regexp is not really capturing what I would expect. The PEG is a bit longer, but it’s fairly straightforward to understand and returns what I would expect. Grabbed the regex from Ultimate Regex Cheat Sheet - KeyCDN Support , but wrote the PEG for emails myself.
Looks like python is changing their internal parser to PEG. From some cursory research, https://janet-lang.org/ has PEG by default instead of PCRE/Regex in the standard library.
After all this exploration, I’m surprised Pegasus isn’t more popular. I’m guessing it might be because people don’t quite understand how to use it despite its awesomeness (or they don’t need to write grammars).Or it might be because it’s not so easy to do a mental-model replacement of Regex. A ton of people use nimble_parsec though it looks like, so it’s probably just a ergonomics thing. Here are some of my first impressions here while I was trying to get it working.
From the docs:
parser options [:collect, :token, :tag, :post_traverse, :ignore] work on elements of a PEG, but I didn’t figure that out until I read the code from the other repos.
parser options [:start_position, :export, :parser, :alias] seem to operate on the data and I’m not sure what they are for.
I was just trying to figure out how to turn the functions in Pegasus into something I understood from other languages aka:
Regex.capture
Regex.match
So I had to go thru your other packages to understand that I actually wanted defparsec because I first tried using [parser: true, export: true].
Regarding captures, I think the issues I had are just documentation related. I ended up going thru your other codebases to figure out that I needed to use [:tag, :collect, :post_traverse]. There’s a little more boilerplate than regex, but I don’t think that can be gotten rid of. But perhaps the default should be [collect: true]? It was very confusing to see a series of characters.
Regarding match, I added a bit of extra boilerplate to get the equivalent functionality of Regex.match.
Definitely, would be nice to have both a match and capture setup that worked cleanly after putting a grammar string in.
@email_options [
Name: [tag: :name],
Domain: [tag: :domain],
At: [ignore: true],
TLD: [tag: :tld],
Dot: [ignore: true]
]
Pegasus.parser_from_string(
"""
Email <- Name At (Domain)+ TLD
Name <- ([a-zA-Z0-9_\.\\-]+)
Domain <- ([A-Za-z0-9\\-]+ Dot)+
TLD <- ([A-Za-z\.]) ([A-Za-z\.])+
Dot <- '.'
At <- '@'
""",
@email_options
)
defparsec :parse_email, parsec(:Email)
def peg_match_email(email) do
case parse_email(email) do
{:ok, result, "", _, _, _} -> :ok
_ -> :error
end
end
Anyway, thank you and great work. I’m halfway done writing a parser for semi-structured text, if it’s useful, I could take some notes on what was tough to figure out.
I’ve literally read about 2 books on how to design an interpreter/compiler, built a primitive language in lex/yacc. If you would ask me nowadays how I did it, I would literally have no answer, as I don’t remember, the material just did not click for me entirely.
My interest got sparked when I understood how macros work in elixir, and how powerful of concept that is. I’ve used those concepts in some interesting binary parsing libraries. The limitation of course is clear, the syntax must abide to elixir parser rules, but in general this covers 99% of my DSL needs.
Taking in consideration that I also had courses at university covering the topic of grammars, I think the topic is either too academical or complex for my liking, or there is a steep learning curve until you understand the practical application correctly.
I am absolutely not a fan or regex, even to this day I try to avoid it whenever possible and thank God for platforms that let you try regex with inputs .
Just from the praise above, I will try PEG when I will have some free time, it seems much more friendly compared to the alternatives.
Thanks to @ityonemo for the library and @tj0 for bringing this up the forum. It sure looks like a better way than long indecipherable regular expressions. I have two questions once I start exploring:
are there examples of how post_traverse work? (My attempts have only yielded some obscure CaseClauseError)
I noted that there is a Pegasus.Components module. How is this supposed to be used?
Sounds like a bug if you’re getting caseclause error. Can you put an issue up with a repro?
Pegasus.Component is an internal tool that is used to help parse PEG grammars, those functions generate nimbleparsec definitions that are used internally by Pegasus.
I was messing around a bit more, here’s an incomplete example of json parsing with PEG (a few post_traverse functions are missing). If you don’t tag or collect, the data comes in reverse. When collected/tagged, it is a charlist. There are quite a few post_traverse functions in here.
Also, for the curious, it doesn’t come close to the performance of Jason. Haven’t had a chance to do any profiling, so not sure where the bottlenecks are, but let’s be honest, having a human-readable grammar for JSON is pretty incredible.
Thanks for the examples. Having seen this I went back to NimbleParsec docs — I didn’t pick up that post_traverse() needs to return {rest, List.t(), context}, and had it return a map instead. Maybe some Livebook tutorials may be helpful; I’ll see if I can build something after I’ve gotten more familiar.
Edit I had been parsing SVG path drawing instructions with very ugly regexes, with a series of
I expected a return for :cjk_char of ["我"], ["三"], and ["a"], but instead received:
[<<230>>]
[<<228>>]
["a"]
I think this is matching the char, but collecting only the first bit. Is this expected behaviour (in which case is it possible to work with non-alphanumeric ranges?)?