How do you write nested and recursive NimbleParsec parsers?

I’m having trouble implementing a parser. Either it enters an infinite recursive loop, or it only parses the first part of the input and drops the rest.

For context, I’m writing a parser for a query language, and it contains the ability to nest boolean expressions. Here’s a stripped-down version of what I’m working with:

defmodule CESQL.Parsec do
  import NimbleParsec

  value_identifier = ascii_string([?a..?z], min: 1, max: 20)

  boolean_operation =
    parsec(:expr)
    |> ignore(string(" "))
    |> choice([
      string("AND"),
      string("OR"),
      string("XOR")
    ])
    |> ignore(string(" "))
    |> parsec(:expr)

  expr =
    choice([
      value_identifier,
      boolean_operation
    ])

  defparsec :expr, expr
end

My end goal is to parse a boolean expression, and I intend to allow for complex nested expressions like (a AND b) OR c. But for starters, this is what I want out of the above code:

iex> CESQL.Parsec.expr("a AND b")
{:ok, ["a", "AND", "b"], "", _, _, _}

However, when I place value_identifier first in expr’s argument to choice/2, the parser takes the first value and drops the rest of the string, which looks like this:

iex> CESQL.Parsec.expr("a AND b")
{:ok, ["a"], " AND b", _, _, _}

Alternatively, when I place boolean_operation as the first choice, I believe the parser enters an infinite loop trying to find the start of an expression, because my test times out.

How can I get this working the way I want it to? I’ve tried using NimbleParsec.lookahead/2 every way I can think of, but I might be misunderstanding how it works, because I’ve had no luck.

I did a lot of parsing of SQL in this project: GitHub - cogini/ecto_extract_migrations: Elixir library to generate Ecto migrations from a PostgreSQL schema SQL file. Uses NimbleParsec and macro-style code generation.

It converts a Postgres database schema dump file into Ecto migrations.

3 Likes

I think I’m actually going to try leex and yecc instead of NimbleParsec. NimbleParsec is awesome, but since I’m trying to implement a query language (the CloudEvents query language), I think it’s more suited to the job. And I only recently discovered this additional feature of Erlang and I’m really excited to use it! For anyone else who wants to parse a language using traditional language-parsing tools, check out Tokenizing and parsing in Elixir with yecc and leex – Andrea Leopardi

Since there are ABNF grammars for SQL around, you might also look at ex_abnf

Using Leex and Yecc is very workable (and what I used for the fundamentals of ex_cldr). But because parsers are fun, here’s a reasonable attempt at parsing your logical expressions in nimble_parsec using a common approach to de-structuring such parsers:

defmodule CESQL.Parsec do
  @moduledoc """
  Based upon the simple grammar of:

    Expression ⇒ Term {AND Term}
    Term ⇒ Factor {OR Factor}
    Factor ⇒ Item | "-" Factor
    Item ⇒ Identifier | "(" Expression ")"

  """
  import NimbleParsec

  whitespace = times(ascii_char([?\s, ?\t]), min: 1)

  # An expression
  defparsec(
    :expr,
    ignore(optional(whitespace))
    |> choice([
      parsec(:term) |> parsec(:op_and) |> parsec(:expr) |> reduce(:postfix),
      parsec(:term)
    ])
  )

  # A term
  defparsec(
    :term,
    choice([
      parsec(:factor) |> parsec(:op_or) |> parsec(:term) |> reduce(:postfix),
      parsec(:factor)
    ])
  )

  # A factor
  defparsec(
    :factor,
    choice([
      parsec(:identifier),
      ignore(ascii_char([?(]))
      |> ignore(optional(whitespace))
      |> parsec(:expr)
      |> ignore(optional(whitespace))
      |> ignore(ascii_char([?)]))
    ])
  )

  # OR operation
  defparsec(
    :op_or,
    ignore(whitespace) |> string("OR") |> ignore(whitespace)
  )

  # AND operation
  defparsec(
    :op_and,
    ignore(whitespace) |> string("AND") |> ignore(whitespace)
  )

  # An identifier (lower case letters)
  defparsec(
    :identifier,
    times(ascii_char([?a..?z]), min: 1) |> reduce({List, :to_string, []})
  )

  # Convert infix list to postfix for more regular "AST"
  def postfix([term_1, op, term_2]) do
    [op, term_1, term_2]
  end

  # Just pattern matching some examples
  def test do
    {:ok, [["AND", "a", "b"]], "", %{}, _, _} = expr("a AND b")
    {:ok, [["OR", "a", "b"]], "", %{}, _, _} = expr("a OR b")

    # precedence
    {:ok, [["AND", "a", ["OR", "b", "c"]]], "", %{}, _, _} = expr("a AND b OR c")
    {:ok, [["AND", ["OR", "a", "b"], "c"]], "", %{}, _, _} = expr("a OR b AND c")

    # nesting
    {:ok, [["OR", "a", ["AND", "b", "c"]]], "", %{}, _, _} = expr("a OR (b AND c)")

    :ok
  end
end
2 Likes

/Self-promotion but I also have pegasus: Pegasus — pegasus v0.2.2 if you like grammars that actually look like grammars

2 Likes

@ityonemo wow, that’s very cool. Will definitely be taking that for a spin!

This post was so useful !
Writing recursive parsers here!

Thanks for such insight!