NimbleParsec: confused about vas vs functions, can't figure out how to create reusable combinators

dmitriid · September 4, 2021, 7:32am

Hi all. I’m no stranger to parser combinators (see my long-abandoned pegjs for erlang). But there’s an issue I can’t seem to understand with NimbleParsec.

A very common thing to do in nearly every parser is to define skippable whitespace/blankspace (among many other common reusable combinators).

Example (from my own pegjs parser defnition):

Blankspace
  = (WhiteSpace / LineTerminatorSequence / Comment)*

WhiteSpace
  = "\t"
  / "\v"
  / "\f"
  / " "
  / "\u00A0"
  / "\uFEFF"
  / Zs

// https://www.compart.com/en/unicode/category/Zs
Zs = [\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]

// LineTerminatorSequence and Comment ommitted for brevity

And then this would be used, well everywhere

// A rule is identifier=value
// There can be any number of whitespace in between
Rule
  = IdentifierName 
    Blankspace
    (StringLiteral Skippable)?
    "=" 
    Blankspace
    Expression
    EOS

Now, the trouble starts when converting this to NimbleParsec.

The first part is easy:

zs = utf8_char([0x0020, 0x00A0, 0x1680, 0x2000..0x200A, 0x202F, 0x205F, 0x3000])

whitespace_character =
  choice([
    ascii_char([?\t, ?\v, 32, ?\t]),
    utf8_char([0x00A0, 0xFEFF]),
    zs
  ])

blankspace = choice([whitespace_character, line_terminator_sequence]) |> repeat()

But then using it… how?

This will not work:

rule = repeat(ascii_char(not: 32)) |> blankspace

** (CompileError) undefined function blankspace/1

You can wrap it into additional repeat or optional but this is extremely redundant and code readability suffers:

## We have already defined blankspace as optional in its own definition
rule = repeat(ascii_char(not: 32)) |> optional(blankspace)

I’ve tried to convert it to a function:

def zs do
  utf8_char([0x0020, 0x00A0, 0x1680, 0x2000..0x200A, 0x202F, 0x205F, 0x3000])
end

def whitespace_character do
  choice([
    # space
    ascii_char([?\t, ?\v, 32, ?\t]),
    utf8_char([0x00A0, 0xFEFF]),
    zs()
  ])
  |> label("whitespace")
end

def blankspace do
  choice([whitespace_character()]) |> repeat()
end

rule = repeat(ascii_char(not: 32)) |> blankspace()

** (CompileError) undefined function blankspace/1

I’ve tried converting rule to a function, but nothing works

So now I’m scratching my head and hoping that the collective wisdom of Elixir Forum will help me

kip · September 4, 2021, 7:36am

When combining rules that aren’t defparsec you have two choices:

One approach using your example is concat(blankspace())
I prefer (and the docs recommend) creating a module with combinators in them. Then you can do:

def blankspace(combinator \\ empty()) do
  combinator()
  |> choice([
    ascii_char([?\t, ?\v, 32, ?\t]),
    utf8_char([0x00A0, 0xFEFF]),
    zs
  ])
end

and then

repeat(ascii_char(not: 32)) |> blankspace()

will work.

Note that with the module with combinators, that module needs to be imported into the main module.

kip · September 4, 2021, 7:43am

That was a messy post because I hit send too fast. Sorry for the zillion quick edits. Basically:

If you’ve defined combinators as function/0 then calling them requires they are wrapped in concat/1. Hence concat(blankspace()).
You can alternatively define them with a default argument of empty() which is the empty combinator and apply combinators to that argument (my example above).
Combinators needs to be in a separate module to the main defparsec and imported there because they are evaluated at compile time, not runtime.

I have a pretty straightforward example which might help.

dmitriid · September 4, 2021, 7:51am

Ah. Now I got it!

(It didn’t help that the compiler error was pointing at a wrong thing. The function was defined, another symbol was undefined, but the compiler didn’t point to that)

kip · September 4, 2021, 7:52am

I hear you, because its meta-programming all the way down with code generation at compile time, the errors can be a bit … difficult to interpret sometimes.

kip · September 4, 2021, 7:59am

BTW, just in case its helpful (noticing your are working with unicode character classes), you may find ex_unicode_set useful. It can generator nimble_parsec lists from unicode sets that you can directly inside into unicode_char/1. See Unicode.Set.to_utf8_char/1. I’ve just noticed the docs need some work (there are none) but there is an example here.

Example

# Codepoints in the unicode Zs class (whitespace)
iex> Unicode.Set.to_utf8_char "\\p{Zs}"
{:ok, [32, 160, 5760, 8192..8202, 8239, 8287, 12288]}

# Codepoints NOT in the unicode Zs class
iex> Unicode.Set.to_utf8_char "\\P{Zs}"
{:ok,
 [
   not: 32,
   not: 160,
   not: 5760,
   not: 8192..8202,
   not: 8239,
   not: 8287,
   not: 12288
 ]}

dmitriid · September 4, 2021, 8:00am

In the end I had to do this in my helper module:

  def blankspace(combinator \\ empty()) do
    combinator |> repeat(choice([whitespace_character(), line_terminator_sequence()]))
  end

Otherwise you can’t use it in pipes:

rule = some_combinator() |> blankspace()

kip · September 4, 2021, 8:01am

Yes, correct - you need to wrap it in concat/1 or define it with the default parameter empty() as you have done. Which makes sense if you think about it - the result of one combinator needs to be passed to the next in some manner …