Pegasus - PEG grammar nimbleparsec generator

ityonemo · August 29, 2023, 5:21pm

Pegasus - Nimbleparsec parser generator

This library converts a PEG parser library to nimbleparsec parsers. You can “hook” extra functions on to the combinators generated by the PEG language.

The PEG language is here:
https://www.piumarta.com/software/peg/peg.1.html

Unlike yecc/leex, This PEG grammar is extremely easy to read and write, given ABNF descriptions given in most RFCs or other standards documents (e.g. ECMA). This also leverages the extremely effective compile-time nature of the NimbleParsec library.

cmo · August 30, 2023, 12:03pm

Dear Sir,

Do you know of any languages that have a PEG grammar defined that one could look at for inspiration?

ityonemo · August 30, 2023, 7:29pm

Most of these:

And this:

github.com

E-xyza/zig_parser/blob/main/lib/grammar/grammar.y

Root <- skip container_doc_comment? ContainerMembers eof

# *** Top level ***
ContainerMembers <- ContainerDeclarations (ContainerField COMMA)* (ContainerField / ContainerDeclarations)

ContainerDeclarations
    <- TestDecl ContainerDeclarations
     / ComptimeDecl ContainerDeclarations
     / doc_comment? KEYWORD_pub? Decl ContainerDeclarations
     /

TestDecl <- KEYWORD_test (STRINGLITERALSINGLE / IDENTIFIER)? Block

ComptimeDecl <- KEYWORD_comptime Block

Decl
    <- (KEYWORD_export / KEYWORD_extern STRINGLITERALSINGLE? / (KEYWORD_inline / KEYWORD_noinline))? FnProto (SEMICOLON / Block)
     / (KEYWORD_export / KEYWORD_extern STRINGLITERALSINGLE?)? KEYWORD_threadlocal? VarDecl
     / KEYWORD_usingnamespace Expr SEMICOLON

This file has been truncated. show original

tj0 · January 12, 2024, 10:00am

Wow, I didn’t realize I needed this library. Much, much easier to use than regular expressions. PEG seems like it would be an excellent addition to the core library as an alternative to Regex.

The following code is me just messing around trying to get things to work, but I hope it helps someone.

defmodule PegasusExample do
  import NimbleParsec
  require Pegasus
  @moduledoc """

  Examples for PEG parsing.

  PegasusExample.get_pairs("grass=4,horse=1, star=2")

  From https://github.com/xored/peg/blob/master/docs/grammar-examples.md
  PegasusExample.get_timestamp("2009-09-22T06:59:28")
  PegasusExample.get_timestamp("2009-09-22 06:59:28")
  PegasusExample.get_timestamp("Fri Jun 17 03:50:56 PDT 2011")
  PegasusExample.get_timestamp("2010-10-26 10:00:53.360")
  """

  @parser_options [
    Pair: [tag: :pair],
    Word: [tag: :word],
    Number: [tag: :number],
    Space: [ignore: true],
    Separator: [ignore: true],
    Equals: [ignore: true],
  ]
  Pegasus.parser_from_string(
  """
  List <- Pair (Space* Separator Space* Pair)*
  Pair <- Word Equals Number
  Word <- [A-Za-z0-9_]+
  Number <- [0-9]+
  Space           <- ' ' / '\t' / EndOfLine
  EndOfLine       <- '\r\n' / '\n' / '\r'
  EndOfFile       <- !.
  Separator       <- ','
  Equals          <- '='
  """,
  @parser_options
  )
  defparsec :get_pairs, parsec(:List)

  @parser_timestamp [
    Hour: [tag: :hour],
    Minute: [tag: :minute],
    Second: [tag: :second],
    Year: [tag: :year],
    Month: [tag: :month],
    Day: [tag: :day],
    TZ: [tag: :tz],
    Space: [ignore: true],
    Separator: [ignore: true],
    Equals: [ignore: true],
  ]
  Pegasus.parser_from_string(
  """
    Timestamp <- DateTime / FreeDateTime

    # Times
    Hour <- [0-1] [0-9] / '2' [0-4]
    Minute <- [0-5] [0-9]
    Second <- [0-5] [0-9] / '60'
    Fraction <- ('.' / ',') [0-9]+
    IsoTz <- 'Z' / ('+' / '-') Hour (':'? Minute)?
    TzL <- [A-Z]
    TzAbbr <- TzL TzL (TzL (TzL TzL?)?)?
    TZ <- IsoTz / TzAbbr
    HM <- Hour ':' Minute Fraction?
    HMS <- Hour ':' Minute ':' Second Fraction?
    Time <- ('T' ' '?)? (HMS / HM) (' '? TZ)?

    # Dates
    Year <- [0-9] [0-9] [0-9] [0-9]
    Month <- '0' [1-9] / '1' [0-2]
    Day <- '0' [1-9] / [1-2] [0-9] / '3' [0-1]
    Date <- Year '-' Month ('-' Day)?

    # Combined
    DateTime <- Date ' '? Time

    # Free style
    MonthAbbr <- 'Jan' / 'Feb' / 'Mar' / 'Apr' / 'May' / 'Jun' / 'Jul' / 'Aug' / 'Sep' / 'Sept' / 'Oct' / 'Nov' / 'Dec'
    WeekDayAbbr <- 'Mon' / 'Tu' / 'Tue' / 'Tues' / 'Wed' / 'Th' / 'Thu' / 'Thur' / 'Thurs' / 'Fri' / 'Sat' / 'Sun'
    FreeDateTime <- WeekDayAbbr ' ' MonthAbbr ' ' Day ' ' Time ' ' Year
 """,
   @parser_timestamp)
   defparsec :get_timestamp, parsec(:Timestamp)
end

iex>  PegasusExample.get_pairs("grass=4,horse=1, star=2")
{:ok,
 [
   pair: [word: ~c"grass", number: ~c"4"],
   pair: [word: ~c"horse", number: ~c"1"],
   pair: [word: ~c"star", number: ~c"2"]
 ], "", %{}, {1, 0}, 23}

ityonemo · January 12, 2024, 4:05pm

well, to be fair Regex is supported at a low level in the VM (effectively a nif).

And there is a PEG in the stdlib

https://www.erlang.org/doc/man/yecc.html

TwistingTwists · January 12, 2024, 5:30pm

Lovely example there! Thanks a bunch!

tj0 · January 12, 2024, 5:42pm

I don’t know what to tell you, I definitely don’t think I’ve gotten smarter, but I tried using these parsers (BNF/ABNF/lex) and I just couldn’t get them to do what I want. I remember very clearly my frustration because I had to spend a few days writing a custom parser.

However, for some reason, I seem to be able to use PEG just fine, so thank you for the library. Damn, I’d give you five likes for making the library, but I only have one to give. And nimble_parsec is also brilliant.

To amuse myself further with this discovery of PEG, I did some benchmarking on parsing emails. Turns out that they are quite comparable in speed and PEG is a tiny bit faster (5-10%) and a lot more consistent on the P99. Hopefully, there’s nothing really funky happening in the benchmark. I’ve put the code and benchmark way down below.

The ridiculous thing I found about this experiment is that the regexp is not really capturing what I would expect. The PEG is a bit longer, but it’s fairly straightforward to understand and returns what I would expect. Grabbed the regex from Ultimate Regex Cheat Sheet - KeyCDN Support , but wrote the PEG for emails myself.

iex> alphabet = Enum.to_list(?a..?z) ++ Enum.to_list(?0..?9)
iex>  {:ok, re_cap} = Regex.compile( "^(?<name>[a-z0-9_\.-]+)@(?<domain>[\da-z\.-]+)\.(?<tld>[a-z\.]{2,5})$" )
iex>  {:ok, re_cap2} = Regex.compile( "^(?<name>[a-z0-9_\.-]+)@(?<domain>[\da-z\.-]+)\.(?<tld>[a-z\.]+)$" )
iex>  email = "#{Enum.take_random(alphabet, 10)}@#{Enum.take_random(alphabet, 5)}.com"

# The TLD doesn't get captured correctly. 
iex> Regex.named_captures(re_cap, email)
%{"domain" => "qixcr.", "name" => "2nzovpcqs3", "tld" => "om"}
iex(315)> Regex.named_captures(re_cap2, email)
%{"domain" => "qixcr.c", "name" => "2nzovpcqs3", "tld" => "m"}

iex> PegasusExample.parse_email(email)
{:ok, [name: ~c"2nzovpcqs3", domain: ~c"qixcr", tld: ~c"com"], "", %{}, {1, 0},
 20}

The code

defmodule PegasusExample do
  import NimbleParsec
  require Pegasus
  # https://www.keycdn.com/support/regex-cheat-sheet
  {:ok, re} = Regex.compile( "^([a-z0-9_\.-]+)@([\da-z\.-]+)\.([a-z\.]{2,5})$" )
  {:ok, re_cap} = Regex.compile( "^(?<name>[a-z0-9_\.-]+)@(?<domain>[\da-z\.-]+)\.(?<tld>[a-z\.]{2,5})$" )

  @email_options [
   Name: [tag: :name],
   Domain: [tag: :domain],
   At: [ignore: true],
   TLD: [tag: :tld],
   Dot: [ignore: true]
  ]

  Pegasus.parser_from_string(
  """
  Email <- Name At (Domain)+ TLD
  Name <- ([a-zA-Z0-9_\.\\-]+)
  Domain <- ([A-Za-z0-9\\-]+ Dot)+
  TLD <- ([A-Za-z\.]) ([A-Za-z\.])+
  Dot <- '.'
  At  <- '@'
  """,
  @email_options
  )
  defparsec :parse_email, parsec(:Email)
  def peg_match_email(email) do
    case parse_email(email) do
    {:ok, result, "", _, _, _} -> :ok
      _ -> :error
    end
  end
end

The benchmark being run:

  alphabet = Enum.to_list(?a..?z) ++ Enum.to_list(?0..?9)
  Benchee.run( %{
  "email_re_capture" => fn ->
     email = "#{Enum.take_random(alphabet, 10)}@#{Enum.take_random(alphabet, 5)}.com"
     Regex.named_captures(re_cap, email)
   end,
  "email_peg_capture" => fn ->
    email = "#{Enum.take_random(alphabet, 10)}@#{Enum.take_random(alphabet, 5)}.com"
    PegasusExample.parse_email(email)
    end,
    },
  print: [benchmarking: false])

  Benchee.run( %{
  "email_re_match" => fn ->
     email = "#{Enum.take_random(alphabet, 10)}@#{Enum.take_random(alphabet, 5)}.com"
     Regex.match?(re, email)
   end,
  "email_peg_match" => fn ->
    email = "#{Enum.take_random(alphabet, 10)}@#{Enum.take_random(alphabet, 5)}.com"
    PegasusExample.peg_match_email(email)
    end,
    },
  print: [benchmarking: false])

The benchmark results for capture

Run1
email_peg_capture       59.75 K       16.74 μs   ±124.79%       13.45 μs       99.53 μs
email_re_capture        16.77 K       59.62 μs    ±87.41%       27.64 μs      167.72 μs

Run2
email_peg_capture       67.01 K       14.92 μs    ±92.65%       13.47 μs       28.94 μs
email_re_capture        57.58 K       17.37 μs    ±73.17%       16.04 μs       32.89 μs

Run3
email_peg_capture       63.27 K       15.81 μs   ±371.03%       13.51 μs       28.87 μs
email_re_capture        58.81 K       17.00 μs   ±123.90%       15.99 μs       41.13 μs

Run4
email_peg_capture       67.35 K       14.85 μs    ±95.18%       13.46 μs       29.41 μs
email_re_capture        50.68 K       19.73 μs    ±93.02%       15.90 μs      117.30 μs

Run5
email_peg_capture       61.56 K       16.24 μs   ±337.60%       13.48 μs       34.45 μs
email_re_capture        59.24 K       16.88 μs   ±160.19%       15.85 μs       40.99 μs

Run6
email_peg_capture       27.46 K       36.42 μs   ±108.31%       14.01 μs      142.27 μs
email_re_capture        20.39 K       49.04 μs   ±101.96%       17.46 μs      167.09 μs

Run7
email_peg_capture       51.39 K       19.46 μs   ±112.58%       13.53 μs      101.15 μs
email_re_capture        19.76 K       50.61 μs    ±98.93%       16.49 μs      160.13 μs

Run8
email_peg_capture       68.10 K       14.68 μs    ±96.96%       13.46 μs       28.67 μs
email_re_capture        42.62 K       23.47 μs   ±127.65%       15.91 μs      122.88 μs

Run9 (on AC)
email_peg_capture       92.31 K       10.83 μs   ±102.89%        9.63 μs       21.32 μs
email_re_capture        80.21 K       12.47 μs    ±97.96%       11.33 μs       23.68 μs

The benchmark results for match

Run1
email_re_match        52.45 K       19.06 μs    ±95.86%       15.14 μs      110.13 μs
email_peg_match       48.50 K       20.62 μs   ±116.19%       13.49 μs      114.79 μs

Run2
email_peg_match       27.24 K       36.72 μs   ±107.88%       13.86 μs      143.16 μs
email_re_match        20.10 K       49.74 μs    ±97.30%       15.49 μs      155.32 μs

Run3
email_peg_match       67.14 K       14.89 μs    ±93.36%       13.46 μs       29.20 μs
email_re_match        61.30 K       16.31 μs    ±85.16%       15.04 μs       31.11 μs

Run4
email_peg_match       66.95 K       14.94 μs    ±99.08%       13.53 μs       28.62 μs
email_re_match        55.78 K       17.93 μs   ±328.09%       15.17 μs      105.51 μs


Run5
email_peg_match       67.99 K       14.71 μs    ±91.58%       13.46 μs       28.56 μs
email_re_match        61.54 K       16.25 μs    ±82.78%       15.06 μs       30.47 μs

Run6
email_peg_match       67.31 k       14.86 μs    ±89.79%       13.48 μs       28.62 μs
email_re_match        61.50 k       16.26 μs    ±83.86%       15.04 μs       31.93 μs

Run7 (on AC)
email_peg_match       86.17 K       11.60 μs   ±194.81%        9.74 μs       21.11 μs
email_re_match        73.69 K       13.57 μs   ±421.42%       10.89 μs       22.02 μs

ityonemo · January 12, 2024, 6:15pm

tried using these parsers (BNF/ABNF/lex) and I just couldn’t get them to do what I want

I’ve been exposed for the mid developer that I am b/c that’s why I wrote Pegasus =D

benchmark

Well damn! That’s great and very unexpected!!

tj0 · January 12, 2024, 7:24pm

Ha, I wonder what that makes the rest of us. It seems even the creator of Python had the same commentary. PEP 617 – New PEG parser for CPython | peps.python.org .

Thirty years ago the first author decided to go his own way with Python’s parser rather than using LALR(1), which was the industry standard at the time (e.g. Bison and Yacc). The reasons were primarily emotional (gut feelings, intuition), based on past experience using Yacc in other projects, where grammar development took more effort than anticipated (in part due to shift-reduce conflicts). A specific criticism of Bison and Yacc that still holds is that their meta-grammar (the notation used to feed the grammar into the parser generator) does not support EBNF conveniences like [optional_clause] or (repeated_clause)*. Using a custom parser generator, a syntax tree matching the structure of the grammar could be generated automatically, and with EBNF that tree could match the “human-friendly” structure of the grammar.

Looks like python is changing their internal parser to PEG. From some cursory research, https://janet-lang.org/ has PEG by default instead of PCRE/Regex in the standard library.

After all this exploration, I’m surprised Pegasus isn’t more popular. I’m guessing it might be because people don’t quite understand how to use it despite its awesomeness (or they don’t need to write grammars).Or it might be because it’s not so easy to do a mental-model replacement of Regex. A ton of people use nimble_parsec though it looks like, so it’s probably just a ergonomics thing. Here are some of my first impressions here while I was trying to get it working.

From the docs:

parser options [:collect, :token, :tag, :post_traverse, :ignore] work on elements of a PEG, but I didn’t figure that out until I read the code from the other repos.
parser options [:start_position, :export, :parser, :alias] seem to operate on the data and I’m not sure what they are for.

I was just trying to figure out how to turn the functions in Pegasus into something I understood from other languages aka:

Regex.capture
Regex.match

So I had to go thru your other packages to understand that I actually wanted defparsec because I first tried using [parser: true, export: true].

Regarding captures, I think the issues I had are just documentation related. I ended up going thru your other codebases to figure out that I needed to use [:tag, :collect, :post_traverse]. There’s a little more boilerplate than regex, but I don’t think that can be gotten rid of. But perhaps the default should be [collect: true]? It was very confusing to see a series of characters.

Regarding match, I added a bit of extra boilerplate to get the equivalent functionality of Regex.match.

Definitely, would be nice to have both a match and capture setup that worked cleanly after putting a grammar string in.

  @email_options [
   Name: [tag: :name],
   Domain: [tag: :domain],
   At: [ignore: true],
   TLD: [tag: :tld],
   Dot: [ignore: true]
  ]

  Pegasus.parser_from_string(
  """
  Email <- Name At (Domain)+ TLD
  Name <- ([a-zA-Z0-9_\.\\-]+)
  Domain <- ([A-Za-z0-9\\-]+ Dot)+
  TLD <- ([A-Za-z\.]) ([A-Za-z\.])+
  Dot <- '.'
  At  <- '@'
  """,
  @email_options
  )
  defparsec :parse_email, parsec(:Email)
  def peg_match_email(email) do
    case parse_email(email) do
    {:ok, result, "", _, _, _} -> :ok
      _ -> :error
    end
  end

Anyway, thank you and great work. I’m halfway done writing a parser for semi-structured text, if it’s useful, I could take some notes on what was tough to figure out.

D4no0 · January 12, 2024, 8:06pm

I’ve literally read about 2 books on how to design an interpreter/compiler, built a primitive language in lex/yacc. If you would ask me nowadays how I did it, I would literally have no answer, as I don’t remember, the material just did not click for me entirely.

My interest got sparked when I understood how macros work in elixir, and how powerful of concept that is. I’ve used those concepts in some interesting binary parsing libraries. The limitation of course is clear, the syntax must abide to elixir parser rules, but in general this covers 99% of my DSL needs.

Taking in consideration that I also had courses at university covering the topic of grammars, I think the topic is either too academical or complex for my liking, or there is a steep learning curve until you understand the practical application correctly.

I am absolutely not a fan or regex, even to this day I try to avoid it whenever possible and thank God for platforms that let you try regex with inputs .

Just from the praise above, I will try PEG when I will have some free time, it seems much more friendly compared to the alternatives.

jkwchui · January 15, 2024, 3:39am

Thanks to @ityonemo for the library and @tj0 for bringing this up the forum. It sure looks like a better way than long indecipherable regular expressions. I have two questions once I start exploring:

are there examples of how post_traverse work? (My attempts have only yielded some obscure CaseClauseError)
I noted that there is a Pegasus.Components module. How is this supposed to be used?

ityonemo · January 15, 2024, 5:02am

Sounds like a bug if you’re getting caseclause error. Can you put an issue up with a repro?

Pegasus.Component is an internal tool that is used to help parse PEG grammars, those functions generate nimbleparsec definitions that are used internally by Pegasus.

tj0 · January 15, 2024, 8:28am

I was messing around a bit more, here’s an incomplete example of json parsing with PEG (a few post_traverse functions are missing). If you don’t tag or collect, the data comes in reverse. When collected/tagged, it is a charlist. There are quite a few post_traverse functions in here.

Also, for the curious, it doesn’t come close to the performance of Jason. Haven’t had a chance to do any profiling, so not sure where the bottlenecks are, but let’s be honest, having a human-readable grammar for JSON is pretty incredible.

defmodule PegJSON do
  require Pegasus

  # https://www.json.org/json-en.html
  # https://github.com/azatoth/PanPG/blob/master/grammars/JSON.peg
  json_grammar = """
    json_parser <- Value

    Value <- S? ( Object / Array / String / True / False / Null / Number ) S?

    Object <- ObjectStart
                ( ObjectPair ( Comma ObjectPair )*
                / S? )
            ObjectEnd
    ObjectStart <- "{"
    ObjectEnd <- "}"
    ObjectPair <- String ":" Value
    Array <- "["
                ( Value ( "," Value )*
                / S? )
            "]"
    String <- S? ["] ( [^ " \ U+0000-U+001F ] / Escape )* ["] S?
    Escape <- [\] ( [ " / \ b f n r t ] / UnicodeEscape )
    UnicodeEscape <- "u" [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f] [0-9A-Fa-f]
    True <- "true"
    False <- "false"
    Null <- "null"
    Comma <- ","
    Number <- Minus? IntegralPart FractionalPart? ExponentPart?
    Minus <- "-"
    #IntegralPart <- "0" / [1-9] [0-9]*
    IntegralPart <- [0-9]*
    FractionalPart <- "." [0-9]+
    ExponentPart <- ( "e" / "E" ) ( "+" / "-" )? [0-9]+
    S <- (' ' / '\t' / '\r\n' / '\n' / '\r')+
    #S <- [ U+0009 U+000A U+000D U+0020 ]+
  """
  json_parser_opts = [
    json_parser: [export: true, parser: true],
    S: [ignore: true],
    Object: [tag: :object, post_traverse: {:post_object, []}],
    ObjectPair: [post_traverse: {:post_obj_pair, []}],
    ObjectStart: [ignore: true],
    ObjectEnd: [ignore: true],
    Array: [tag: :array],
    #String: [tag: :str, post_traverse: {:post_str, []}],
    String: [post_traverse: {:post_str, []}],
    Escape: [tag: :escape],
    Number: [tag: :number, post_traverse: {:post_number, []}],
    Comma: [ignore: true],
    #JSON: [tag: :json]
  ]

  defp post_obj_pair(rest, args, context, _line, _offset) do
    # IO.inspect rest
    # IO.inspect context
    [value, ":", key] = args
    {rest, [ {key, value}], context}
  end
  defp post_str(rest, args, context, _line, _offset) do
    #[str: str] = args
    #IO.inspect args
    #str = Enum.slice(args, 1..-2)|> Enum.reverse |> to_string
    str = Enum.slice(args, 1..-2)|> to_string
    #str = to_string(Enum.slice(args, 1..-2))
    {rest, [ str ], context}
  end
  defp post_number(rest, args, context, _line, _offset) do
    [number: str] = args
    str = String.to_integer(to_string(str))
    {rest, [ str ], context}
  end
  defp post_object(rest, args, context, _line, _offset) do
    [object: object] = args
    obj = Enum.into(object, %{})
    {rest, [obj], context}
  end

  Pegasus.parser_from_string(json_grammar, json_parser_opts)

  def bench do
    t1 = "{\"a_key\":123,\"b_key\":456}"
    t4 = "{\"a_key\": \"here\",\"b_key\": \"done\"}"
    t3 = "{\"a\":1,\"b\":22}"
    t2 = "{\"a\":1,\"b\": 2}"
    Benchee.run( %{
    "jason_decode" => fn -> Jason.decode(t1) end,
    "peg_json_decode" => fn -> PegJSON.json_parser(t1) end,
    },
    print: [benchmarking: false, suite: false])
  end
end

jkwchui · January 15, 2024, 9:03am

Thanks for the examples. Having seen this I went back to NimbleParsec docs — I didn’t pick up that post_traverse() needs to return {rest, List.t(), context}, and had it return a map instead. Maybe some Livebook tutorials may be helpful; I’ll see if I can build something after I’ve gotten more familiar.

Edit I had been parsing SVG path drawing instructions with very ugly regexes, with a series of

@path_extraction_QS ~r/(?<draw>[QqSs])((?<x1>[\d]+\.*[\d]*)+\s(?<y1>[\d]+\.*[\d]*)\s(?<x2>[\d]+\.*[\d]*)+\s(?<y3>[\d]+\.*[\d]*)+)+/

It’s totally worth the time to learn PEG / Pegasus.

defmodule Peg.SvgPath do
  import NimbleParsec
  require Pegasus

  @svg_instruction_options [
    Instructions: [tag: :instructions],
    
    ThreeXY:      [tag: :three_xy],
    TwoXY:        [tag: :two_xy],
    OneXY:        [tag: :one_xy],
    Directional:  [tag: :directional],
    Close:        [tag: :close],
    
    XYPair:   [tag: :xypair, post_traverse: :rename],
    Number:   [tag: :number, collect: true],
    
    Space:    [ignore: true]
  ]
  Pegasus.parser_from_string(
    """
    Instructions      <- (ThreeXY / TwoXY / OneXY / Directional / Close)*

    ThreeXY           <- [Cc] Triple_XYPairs
    TwoXY             <- [QqSs] Double_XYPairs
    OneXY             <- [MmLlTt] XYPair
    Directional       <- [HhVv] Number
    Close             <- [Zz]
    
    Triple_XYPairs    <- XYPair Space XYPair Space XYPair
    Double_XYPairs    <- XYPair Space XYPair
    XYPair            <- Number Space Number

    Number            <- Integer (DecimalSeparator Integer)?
    Integer           <- [0-9]+
    DecimalSeparator  <- "."

    Space           <- ' '
    """,
    @svg_instruction_options
  )
  defparsec :get_instructions, parsec(:Instructions)

  defp rename(rest, args, context, _line, _offset) do
    [xypair: [number: [x], number: [y]]] = args
    {rest, [%{x: Decimal.new(x), y: Decimal.new(y)}], context}
  end
end

ityonemo · January 15, 2024, 6:03pm

I believe that is a NimbleParsec thing. I should add that to the documentation

jkwchui · February 8, 2024, 2:06am

I’m trying to use this to parse CJK strings, and I’m at a loss whether it is possible to designate unicode strings. What I tried is:

defmodule ExCantonese.Markup.Parser.MVP do
  import NimbleParsec
  require Pegasus

  def parse(markup) do
    {:ok, [cjk: results], _remainder, _, _, _} = parse_markup(markup)
    results
  end

  @cjk_options [
    CJK: [tag: :cjk],
    CJKchar: [tag: :cjk_char, collect: true]
  ]
  Pegasus.parser_from_string(
    """
    CJK               <- CJKchar

    CJKchar           <- [a-z]
                       / [\u3400-\u9FAF]
                       / [\u20021-\u2F8A6]
    """,
    @cjk_options
  )
  defparsec :parse_markup, parsec(:CJK)
end

Against test cases of

cjk_2a = "我"
cjk_2b = "三"
cjk_3 = "a"

I expected a return for :cjk_char of ["我"], ["三"], and ["a"], but instead received:

[<<230>>]
[<<228>>]
["a"]

I think this is matching the char, but collecting only the first bit. Is this expected behaviour (in which case is it possible to work with non-alphanumeric ranges?)?

ityonemo · February 8, 2024, 6:40am

You’ll have to do a byte by byte match. The spec

https://www.piumarta.com/software/peg/peg.1.html

Does not support \u.

You may want to check out how the zig parser handles “arbitrary Unicode”:

github.com

ziglang/zig-spec/blob/f7fb3d084285a296dcc63c89ac3d66e79571f79d/grammar/grammar.y#L391


      
          oxF0 <- '\360'
          ox90_0xBF <- [\220-\277]
          oxEE_oxEF <- [\356-\357]
          oxED <- '\355'
          ox80_ox9F <- [\200-\237]
          oxE1_oxEC <- [\341-\354]
          oxE0 <- '\340'
          oxA0_oxBF <- [\240-\277]
          oxC2_oxDF <- [\302-\337]
          
          # From https://lemire.me/blog/2018/05/09/how-quickly-can-you-check-that-a-string-is-valid-unicode-utf-8/
          # First Byte      Second Byte     Third Byte      Fourth Byte
          # [0x00,0x7F]
          # [0xC2,0xDF]     [0x80,0xBF]
          #    0xE0         [0xA0,0xBF]     [0x80,0xBF]
          # [0xE1,0xEC]     [0x80,0xBF]     [0x80,0xBF]
          #    0xED         [0x80,0x9F]     [0x80,0xBF]
          # [0xEE,0xEF]     [0x80,0xBF]     [0x80,0xBF]
          #    0xF0         [0x90,0xBF]     [0x80,0xBF]     [0x80,0xBF]
          # [0xF1,0xF3]     [0x80,0xBF]     [0x80,0xBF]     [0x80,0xBF]
          #    0xF4         [0x80,0x8F]     [0x80,0xBF]     [0x80,0xBF]

TwistingTwists · February 12, 2024, 1:20pm

What is the syntax for matching byte?

I want to match – which is a special character with hex code 0x2013.

defmodule MyParser2 do 
  import NimbleParsec
  require Pegasus
  

  Pegasus.parser_from_string(
  """
  header    <- colon / special_minus 
  colon         <- ":" 
  special_minus <- "0x2013"

 """)

  defparsec :parse_question, parsec(:header)

end

MyParser2.parse_question("–")

{:error, "expected string \"0x2013\"", "–", %{}, {1, 0}, 0}

Also, tried the following

special_minus <- 0x2013
special_minus <- [0x2013]

None of them compile.

TwistingTwists · February 12, 2024, 1:57pm

github.com

E-xyza/Exonerate/blob/1a639563a64ee1bf2b3ad4417871c3d3034c7077/lib/exonerate/formats/iri.ex#L115-L123


      
          defcombinatorp(
            :IRI_ucschar,
            utf8_char(
              not: 0..127,
              not: 0xE000..0xF8FF,
              not: 0xF0000..0xFFFFD,
              not: 0x100000..0x10FFFD
            )
          )

Wow. Combining PEG grammars with Nimbleparsec is :chef_kiss: