ExSpirit - Spirit-style PEG-like parsing library for Elixir

Elixir has, well, way too many parsing libraries, and I tried a lot of them to parse a little experiment of mine, however I kept running into issues, everything from error reporting being near useless (leex/yecc) to missing a lot of primitives (Combine) to all of them seeming to have a bit of overhead in a programming perspective. Well after going through no less than 6 parsing libraries to parse my little experiment from both the Elixir and Erlang worlds, I decided to port an old project I used to work on (rightfully proclaimed fastest parser in the world, for good reason, this new one is not being built for speed though).

So Iā€™m making ExSpirit, thus far only having the parsing portion (lexā€™ing is less useful with this style parser though I plan to support that later if I see a need or anyone gives me a need), with the name of ExSpirit.Parser.

Iā€™ll describe it after this paragraph, but first, is it worth putting this on hex.pm considering how many other parsing libraries there are? I really hate lots of competing libraries where none is really dominate and mine would just add to that problemā€¦

However, to use mine you have to define a module where your parsers will belong, here is mine form one of my tests:

defmodule ExSpirit.Tests.Parser do
  use ExSpirit.Parser

  defrule testrule(
    seq([ uint(), lit(?\s), uint() ])
    )

  defrule testrule_map(
    seq([ uint(), lit(?\s), uint() ])
    ), map: Enum.map(fn i -> i-40 end)

  defrule testrule_fun(
    seq([ uint(), lit(?\s), uint() ])
    ), fun: (fn context -> %{context | result: {"altered", context.result}} end).()

  defrule testrule_context(context) do
    %{context | result: "always success"}
  end

This is not how it would traditionally be written, this is showing the advanced usage features, however the use ExSpirit.Parser at the top sets up everything (yeesh the syntax coloring on these forums need helpā€¦). The doctests that use this specific test module will show you most of the currently existing capabilities:


    # `lit` matches a specific string or character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test 42", lit("Test"))
    iex> {context.error, context.result, context.rest}
    {nil, nil, " 42"}

    # `lit` matches a specific string or character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test 42", lit(?T))
    iex> {context.error, context.result, context.rest}
    {nil, nil, "est 42"}

    # `uint` parses out an unsigned integer, default radix of 10 with a min size of 1 and max of unlimited
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42", uint())
    iex> {context.error, context.result, context.rest}
    {nil, 42, ""}

    # `|>` Returns the result of the last parser in the pipe chain,
    # `lit` always returns nil for example
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42Test", uint() |> lit("Test"))
    iex> {context.error, context.result, context.rest}
    {nil, nil, ""}

    # `|>` Returns the result of the last parser in the pipe chain
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42Test64", uint() |> lit("Test") |> uint())
    iex> {context.error, context.result, context.rest}
    {nil, 64, ""}

    # `uint` parsing out base-2
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("101", uint(2))
    iex> {context.error, context.result, context.rest}
    {nil, 5, ""}

    # `uint` parsing out base-16 lower-case, can be mixed too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("ff", uint(16))
    iex> {context.error, context.result, context.rest}
    {nil, 255, ""}

    # `uint` parsing out base-16 upper-case, can be mixed too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("FF", uint(16))
    iex> {context.error, context.result, context.rest}
    {nil, 255, ""}

    # `seq` parses a sequence returning the return of all of them, removing nils,
    # as a list if more than one or the raw value if only one, if any fail then
    # all fail.
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", seq([uint(), lit(" "), uint()]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [42, 64], ""}

    # `seq` Here is sequence only returning a single value
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42Test", seq([uint(), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, 42, ""}

    # `alt` parses a set of alternatives in order and returns the first success
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("FF", alt([uint(16), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, 255, ""}

    # `alt` parses a set of alternatives in order and returns the first success
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("Test", alt([uint(16), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, nil, ""}

    # You can use `defrule`s as any other terminal parser
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [42, 64], ""}

    # `defrule`'s also set up a stack of calls down a context so you know
    # 'where' an error occured, so name the rules descriptively
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 fail", testrule())
    iex> {contexts.error.context.rulestack, contexts.result, contexts.rest}
    {[:testrule], nil, "fail"}

    # `defrule`s can map the result to return a different one:
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_map())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [2, 24], ""}

    # `defrule`s can also operate over the context itself to do anything
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_fun())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, {"altered", [42, 64]}, ""}

    # `defrule`s can also be a context function by only passing in `context`
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_context())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, "always success", "42 64"}

    # `tag` can tag the output from a parser
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("ff", tag(:integer, uint(16)))
    iex> {context.error, context.result, context.rest}
    {nil, {:integer, 255}, ""}

    # You can have a skipper too, skippers should be run at the start of any
    # terminal parser, it runs only once per pass, if you want it to repeat then
    # set the skipper up so it repeats, a good one is `repeat(lit(?\\s))` for
    # example
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" 42 ", uint(), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}

    # You can turn off a skipper for a parser as well with `no_skip`
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" Test:42 ", lit("Test:") |> no_skip(uint()), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}
    {nil, 42, " "}

    # You can change a skipper for a parser as well with `skip`
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" Test:\t42 ", lit("Test:") |> skip(uint(), lit(?\\t)), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}

    # `char` can parse out any single character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char())
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character as well
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char(?T))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character from a range too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char(?A..?Z))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character from a list of
    # characters or ranges too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char([?a..?z, ?T]))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

So unlike the old C++ library that I worked on years ago, this one is not operator-explosion, though that really would help make it readable I think, I decided I wanted descriptive names for everything.

I made a benchmark of simple integer parsing and of datetime parsing (the datetime parsing example from the Combine parsing library specifically), and benched with them benchee in a quick setup, the results:

##### With input parse_datetime #####
Name                ips        average  deviation         median
ex_spirit      204.55 K        4.89 Ī¼s    Ā±12.16%        4.70 Ī¼s
combine         76.59 K       13.06 Ī¼s    Ā±44.50%       16.00 Ī¼s

Comparison:
ex_spirit      204.55 K
combine         76.59 K - 2.67x slower

##### With input parse_int_10 #####
Name                ips        average  deviation         median
ex_spirit      626.87 K        1.60 Ī¼s    Ā±31.93%        1.60 Ī¼s
combine        162.52 K        6.15 Ī¼s    Ā±12.90%        6.20 Ī¼s

Comparison:
ex_spirit      626.87 K
combine        162.52 K - 3.86x slower

So my style does not seem slower at least, and seems to be a little bit faster, plus it is more built for the style that I expect, which is not necessarily the style everyone expects, but rather it works well for me. ^.^

13 Likes

Oh, an example of an error message, for the above contexts = parse("42 fail", testrule()) example in the doctests, the full error message is output as:

<unknown>:1:4: Parsing uint with radix of 10 had 0 digits but 1 minimum digits were required
        RuleStack: [testrule]
        Input: fail

<unknown> is the filename, which is that because it is just a raw string parse, the 1 is the line number, and the 4 is the character it tried to parse but failed on, after that is the parser-specific error message, then on a newline the stack of rules, and on a newline after that the specific input it failed on (capped). :slight_smile:

I can easily change the format as well if it fits in better with things too, suggestions are as always useful. :slight_smile:

EDIT: And the error message is not output to stdout or something irritating like that, it is held in the context (return value from parse) error key (which is also a map of data with little useful bits), so access the error via context.error assuming your output variable is context. :slight_smile:

The context map/struct holds the result, the unparsed rest of the input (good to match on if you want to verify all the input was taken, or not), the filename, byte position, line number, column it stopped on (in UTF8 codepoints for text), and a few other minor things you will probably never need but are useful if you do.

2 Likes

The doctests have grown to contain more parsers:


    # `lit` matches a specific string or character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test 42", lit("Test"))
    iex> {context.error, context.result, context.rest}
    {nil, nil, " 42"}

    # `lit` matches a specific string or character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test 42", lit(?T))
    iex> {context.error, context.result, context.rest}
    {nil, nil, "est 42"}

    # `uint` parses out an unsigned integer, default radix of 10 with a min size of 1 and max of unlimited
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42", uint())
    iex> {context.error, context.result, context.rest}
    {nil, 42, ""}

    # `|>` Returns the result of the last parser in the pipe chain,
    # `lit` always returns nil for example
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42Test", uint() |> lit("Test"))
    iex> {context.error, context.result, context.rest}
    {nil, nil, ""}

    # `|>` Returns the result of the last parser in the pipe chain
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("42Test64", uint() |> lit("Test") |> uint())
    iex> {context.error, context.result, context.rest}
    {nil, 64, ""}

    # `uint` parsing out base-2
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("101", uint(2))
    iex> {context.error, context.result, context.rest}
    {nil, 5, ""}

    # `uint` parsing out base-16 lower-case, can be mixed too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("ff", uint(16))
    iex> {context.error, context.result, context.rest}
    {nil, 255, ""}

    # `uint` parsing out base-16 upper-case, can be mixed too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("FFF", uint(16))
    iex> {context.error, context.result, context.rest}
    {nil, 4095, ""}

    # `seq` parses a sequence returning the return of all of them, removing nils,
    # as a list if more than one or the raw value if only one, if any fail then
    # all fail.
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", seq([uint(), lit(" "), uint()]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [42, 64], ""}

    # `seq` Here is sequence only returning a single value
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42Test", seq([uint(), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, 42, ""}

    # `alt` parses a set of alternatives in order and returns the first success
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("FF", alt([uint(16), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, 255, ""}

    # `alt` parses a set of alternatives in order and returns the first success
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("Test", alt([uint(16), lit("Test")]))
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, nil, ""}

    # You can use `defrule`s as any other terminal parser
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [42, 64], ""}

    # `defrule`'s also set up a stack of calls down a context so you know
    # 'where' an error occured, so name the rules descriptively
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 fail", testrule())
    iex> {contexts.error.context.rulestack, contexts.result, contexts.rest}
    {[:testrule], nil, "fail"}

    # `defrule`s can map the result to return a different one:
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_map())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, [2, 24], ""}

    # `defrule`s can also operate over the context itself to do anything
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_fun())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, {"altered", [42, 64]}, ""}

    # `defrule`s can also be a context function by only passing in `context`
    iex> import ExSpirit.Tests.Parser
    iex> contexts = parse("42 64", testrule_context())
    iex> {contexts.error, contexts.result, contexts.rest}
    {nil, "always success", "42 64"}

    # `tag` can tag the output from a parser
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("ff", tag(:integer, uint(16)))
    iex> {context.error, context.result, context.rest}
    {nil, {:integer, 255}, ""}

    # You can have a skipper too, skippers should be run at the start of any
    # terminal parser, it runs only once per pass, if you want it to repeat then
    # set the skipper up so it repeats, a good one is `repeat(lit(?\\s))` for
    # example
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" 42 ", uint(), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}

    # You can turn off a skipper for a parser as well with `no_skip`
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" Test:42 ", lit("Test:") |> no_skip(uint()), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}
    {nil, 42, " "}

    # You can change a skipper for a parser as well with `skip`
    iex> import ExSpirit.Tests.Parser
    iex> context = parse(" Test:\t42 ", lit("Test:") |> skip(uint(), lit(?\\t)), skipper: lit(?\\s))
    iex> {context.error, context.result, context.rest}
    {nil, 42, " "}

    # `char` can parse out any single character
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char())
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character as well
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char(?T))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character from a range too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char(?A..?Z))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `char` can parse out any 'specific' single character from a list of
    # characters or ranges too
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", char([?a..?z, ?T]))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, "est"}

    # `ignore` will run the parser but return no result
    iex> import ExSpirit.Tests.Parser
    iex> context = parse("Test", ignore(char([?a..?z, ?T])))
    iex> {context.error, context.result, context.rest}
    {nil, nil, "est"}

    # `branch` is mostly for efficiency, it is useful for when you have a single
    # parser that is then followed by a specific parser.  It takes two arguments,
    # the first of which is the initial 'symbol' parser, the second is a map of
    # %{value => parserFuns}, the output of the symbol parser is used to look up
    # the value in the map, then the found parser is run.  If the symbol parser
    # fails, or the key is not in the map, or the post-parser fails then this
    # fails.  Because of the anonymous functions this has a slight overhead so
    # only use this above `choice` when the amount is at least more than a few.
    # This returns the output of the post-parser only.
    iex> import ExSpirit.Tests.Parser
    iex> symbol_map = %{?b => &uint(&1, 2), ?d => &uint(&1, 10), ?x => &uint(&1, 16)}
    iex> context = parse("b101010", branch(char(), symbol_map))
    iex> {context.error, context.result, context.rest}
    {nil, 42, ""}
    iex> context = parse("d213478", branch(char(), symbol_map))
    iex> {context.error, context.result, context.rest}
    {nil, 213478, ""}
    iex> context = parse("xe1DCf", branch(char(), symbol_map))
    iex> {context.error, context.result, context.rest}
    {nil, 925135, ""}

    # `symbols` takes a ExSpirit.TST, which is a structure designed for fast
    # lookups, though slow insertions, so please cache the data structure at
    # compile-time if possible.  This `symbols` parser will take the text input
    # stream and match it on the TST to find the longest-matching string, then
    # it will run the parser on it like it is done in `branch`.  Similar
    # semantics to branch otherwise.
    iex> import ExSpirit.Tests.Parser
    iex> alias ExSpirit.TST, as: TST
    iex> symbol_tst = TST.new() |> TST.add_text("int", &uint(&1)) |> TST.add_text("char", &char(&1))
    iex> context = parse("int42", symbols(symbol_tst))
    iex> {context.error, context.result, context.rest}
    {nil, 42, ""}
    iex> context = parse("charT", symbols(symbol_tst))
    iex> {context.error, context.result, context.rest}
    {nil, ?T, ""}


1 Like

I ended up using it more than I expected so I published it: https://hex.pm/packages/ex_spirit

2 Likes

For some reason I missed it when you posted the first posts two days ago.

This sounds really great! Parser Combinators (re-usable LL(*) parser libraries) are lovely tools. When I have some more time, Iā€™ll dissect it and see what impact your design choices vs. Combineā€™s have.

Something else that would be really great, is to have an implementation of a Generalized LL(*) parser, which I think that Elixir, with its concurrent properties, might be a extremely good fit for.

1 Like

This looks superb. Congratulations! :raised_hands:

1 Like

I added 3 full examples in the source, with example runs on the readme page (roman numerals, integer adding, and a full blown simple xml parser): ExSpirit ā€“ ex_spirit v0.4.0 :slight_smile:

Combineā€™s seemed quite decent, it just did not have a couple of features that I really needed to be able to do in a different way for what I was doing. :slight_smile:

My benchmarking (feel free to check GitHub - OvermindDL1/ex_parsing_benchmark as Iā€™m curious if the same results happen for other people too, just run mix bench after a mix deps.get) shows mine to be a couple times faster than Combineā€™s, but that is probably because I am minimizing anonymous function usage (which changes my required design a bit compared to Combineā€™s, but similar still), however the benchmarking is so fast for both that benchee is reporting that it may be inaccurate for the times (though the difference in times between both should still be fairly accurate).

It is indeed a PEG-style LR parser though as you surmised, so no left recursive calls, that would just be infinite recursion like it would be in elixir itself, though non-left with alternatives would no doubt be fine. ^.^

Let me drop in the simple xml example I made here (ignore the forumā€™s bad syntax color coding of it of course) for an example of some non-left recursion (and I know this example could be made better, but I was in a hurry ^.^):

defmodule SimpleXML do
  use ExSpirit.Parser, text: true

  defrule text( repeat(char(-?<), 1) ), map: :erlang.iolist_to_binary()

  defrule tag_name( repeat(char([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]), 1) )

  defrule tag(
    branch(seq([ lit(?<), tag_name, lit(?>) ]),
      fn name ->
        name = :erlang.iolist_to_binary(name)
        fn context ->
          context |> expect(tag(name, seq([
            repeat(node_()),
            lit("</"), lit(name), lit(?>),
          ])))
        end
      end)
  )

  defrule node_(
    alt([
      tag(),
      text(),
    ])
  )

  def from_string(input) do
    parse(input, node_())
  end

end
1 Like

Looks interesting, Iā€™ll have to dig a bit deeper!

One thing that Iā€™ve been missing (or at least not aware of) is a backtracking parser that can give me a nice parse tree, pairing each token / set of tokens with the rule it ended up matching.

1 Like

Iā€™d love a pseudo-code example of something you are wanting to do as Iā€™d like to give it a try. :slight_smile:

Update Time

I updated ExSpirit again, added in one of the major functionalities that I was wanting from the original Spirit C++ library, the state system, and I changed the simple_xml example to use it, which simplified its rules nicely. I also added a lot more meta-parsers, need to finish filling out the little parsers sometime (two I really want to get done is a floating point parser and a regex parser).

But the new simple_xml example:

defmodule SimpleXML do
  use ExSpirit.Parser, text: true

  defrule text( chars(-?<) )

  defrule tag_name( chars([?a..?z, ?A..?Z, ?0..?9, ?_, ?-]) )

  defrule tag(
    lit(?<) |> tag_name |> put_state(:tagname, :result) |> lit(?>) |> expect(seq([
      get_state_into(:tagname, tag(&1, repeat(node_()))),
      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)
    ]))
  )

  defrule node_(
    alt([
      tag(),
      text(),
    ])
  )

  def from_string(input) do
    parse(input, node_())
  end

end

Especially note this part:

    lit(?<) |> tag_name |> put_state(:tagname, :result) |> lit(?>) |> expect(seq([
      get_state_into(:tagname, tag(&1, repeat(node_()))),
      lit("</"), get_state_into(:tagname, lit(&1)), lit(?>)
    ]))

The first line has a put_state, it is taking the :result of whatever was just parsed and keyā€™ing it with :tagname, you can then pull it out and such later, of which I do twice using a very useful helper I made called get_state_into, you can have it output an old stored value into the result by going get_state_into(:tagname, :result) as one example, but that was not useful to me here, rather I wanted to pass it into the lit parser as an argument, but I obviously cannot do it ā€˜right nowā€™ as parsing has not started yet, so Iā€™m doing get_state_into(:tagname, lit(&1)), where it will pass in the value that is in the key into the labeled position &1. It also supports more as well, so you could do something like get_state_into([:blah, :blorp, {:bleep, 42}], some_parser(&1, &3, &2)), which will call some_parser with whatever is in the state keyā€™s of :blah, :bleep, and :blorp, in that order, but :bleep has a default of 42 if there is nothing in that key. This now makes it trivial to do something like XML parsing and have the ending tags actually be a parse error instead of some error at a later stage if you so wish. ^.^

But with this update (which is a major version bump because I slightly yet trivially changed the chars API necessarily, still not 1.0.0 yet though) I have enough done to do what I need. ^.^

1 Like

I just wanted to add some of my experience using ExSpirit. Iā€™m using it to lex programming languages for my syntax highlighting library. Itā€™s extremely expressive and very natural to use for this purpose.

This is to be expected, and not at all surprising, as most programming languages can be lexed (not parsed!) with a PEG parser, and PEG parsers are a solved problem. ExSpirit can of course be used to program a PEG parser, and is excellent for that task.

But in my opinion, what sets ExSpirit apart is what it provides beyond the tools for parsing context-free grammars. With ExSpirit you have permanent access to the entire context of the parser. This includes not only the remainder of the input and the current result, but also line and column information, a local state you can populate with whatever you want and a global state thatā€™s threaded through the parsing. This gives you power to go beyond what a PEG parser can do.

For example, did you know that a PEG parser canā€™t parse XML (which hs matching tags) or any kind of language that depends on indentation, like Python ot Haskell? Currently the Haskell and Python compilers use stateful lexers to litter the token list with special tokens so that a context-free parser can make sense of the source.

On the other hand, ExSpirit can do this quite simply, because at any moment you have access to the state and can incorpotate it in you parsing strategy. Itā€™s amazing to be able to define parsers named indent and dedent that detect indents or dedents in the source as if you didnā€™t need the context. Actually, these parsers access the context behind the scenes, but the user can just treat them as if they were normal delimiters :slight_smile: And all of this without messing with the line and column position of your tokens! No need to insert special tokens anywhere.

Iā€™m the first to admit that the documentation is somewhat lacking (I should contribute to it myself) and compile times can be slow for complex grammars, but the source is quite readable once you ignore the optimizations and focus on the essentials.

The main medsage here is that if you ever need to parse something, please do take a look at ExSpirit. You might be surprised.

I reallyvshould folow up ok this comment with a blog post and some documentation PRs to ExSpirit, because thisnlibrary deserves to be more widely known.

4 Likes

Yeah I need to work on that more. ^.^;
(More defrule usage can help it compile faster, though it becomes less optimized with those calls, still great overall though, not the fastest out, but is the fastest for its feature-set)

Thanks much for the kind words!!! :slight_smile:

3 Likes

You should try my suggestion of writing a SlowParser module with the same API but only with defrules and anonymous functions instead of cases. Even if itā€™s just for prototyping

I could probably write a generator that auto-wraps all the calls in a defrule too as an option on the use, hmmā€¦

This seems a little overkillā€¦ Maybe just import the macros and reexport as defrules?

That is basically what it would do, just export them as defrules instead. ^.^

Some more feedback about ExSpirit:

Despite what Iā€™ve once told you in private, I am now against changing the API into something closer to a DSL. (alt([...]) to alt(...), seq([...]) to seq(...) or anything else that used infix operators)

Why my change of heart? Iā€™m now using the context fully (to access state, get line and column numbers, etc). When you need access to the context, pretending youā€™re just doing ā€œnormal PEG stuffā€ is disingenious and useless. Better embrace normal elixir macros and function calls, because thatā€™s what youā€™ll be using with things like pipe_context_around (which by the way is the best parsing primitve ever). The parsing context should still be piped implicitly, of course, no need to litter the parsers woth lambdas and stuff xD

If you want to provide a different API, do so in another module and keep this one working (you ccould compile from the new API from this one), because the current API, even if it is more verbose than it has to be, is easy to understand, easy to use in macros and almost ā€œLispyā€. In fact, itā€™s a little more ā€œtypesafeā€ than Lisp because it uses actual lists as arguments instead of mixing a list of arguments with keyword prameters to disambiguate things.

If youā€™re interested, maybe you could add my indent/dedent/same_indent primitives (as soon as I publish them on github, of course) - which youā€™re probably able to optimize much further, of course, Iā€™m a newby to this whole binary pattern matching on the BEAM. It would make a great ā€œselling pointā€ when comparing to other parsers.

I donā€™t know if youā€™d gain any performance from defining those inside the __using__ macro, and if not I could publish my own package (ex_spirit_indent, for example)

Some ideas for documentation (maybe more suited for a github issue? I can repost it there too):

I think we should copy what the Kernel.SpecialForms does. There should be dummy macro declarations just for documentation purposes, together with a stern warning somewhere in the docs never to import anything from that module under the pain of accidentally summoning Cthulhu from the elder dimensions beyond the endless void where the monotonic time does not exist, the order of events is not preserved and all scream in pain and are forced into using javascript for ever and ever.

There should be documentation on how to write macros that expand into ExSpirit parsers. This is useful not only to be able to reuse parsers among packages, but also to have parametric parsers in which the arguments are known at compile time. In the Makeup package, this cuts the lines of my ElixirLexer in half or more (and more important, Iā€™d have quit in frustration if I had had to copy-paste everything 8-10 times(!) changing some constants in 6 places).

This does pose a problem, though: writing your own macros requires being aware of the context_ast parameter thatā€™s being passed around. If we put it in the dummy parsersā€™ type signatures, weā€™ll have to explain that the comtext is passed around. If we donā€™t we will have to explain this anyway when documenting how to write macros.

I vote on having the context_ast parameter and explaining why itā€™s there.

The aim of all of this is a little bit selfish, of course. If my ex_doc_makeup package (not on hex yet) is meant to have any significant degree of adoption, or even incorporated into ex_doc, people have to trust ExSpirit and understand how it works :stuck_out_tongue: If I want people to contribute lexers to makeup, I also need people to be able to write their own ExSpirit parsers without diving into the source xD

Lol, yeah I use it quite a bit as well. ^.^

Plus I use defrule blah(context, ...) quite a lot actuallyā€¦ ^.^;

Yeah the main one will keep the same API for sure, Iā€™m using it in too many things to ever change it too much at this point. :slight_smile:

Thatā€™s be a great primitive to add that could be quite generically useful. :slight_smile:

If yourā€™s are macroā€™s then no performance difference at all. If inline functions then youā€™d want to ā€˜useā€™ them into the main context so as to not incur the cross-module-call speed hit (which is not too big an issue unless in a tight loop, which some parsing things definitely are).

Yeah the context struct I do want to get fully documented. It is the first-class full-out plug for the parser and any non-trivial parser will inevitably touching it (easy to do anyway).

Lol, true true!

I can send a PR with partial documentation, as I donā€™t understand the parse system well yet. Youā€™d have to complete the parts I leave blank.

1 Like

You could even leave in a marker like <TODO> or something easily greppable like that. ^.^