It might be worthwhile to attempt to switch the order of the parsers, such that extlang
will only be used when there really are only three characters.
I appreciate the suggestion - and Iâve tried all manner of ordering without success. I canât seen a way to avoid needing some kind of forcing backtracking (fail
ing a parser in my post above).
A very quick suggestion is to unfold the ABNF grammar into complete combinators and then refactor it. IIRC ABNF is able of tracking back multiple levels but nimble isnât, so you need to restructure the problem. The math example shows how the solutions end-up a bit different.
The other thing I noticed is that you are excessively relying on parsec
. You should compose on the helpers as much as possible and fallback to combinators only if compilation times are high.
I second @josevalim here. If you use regular combinators, the parser will be measurably faster, although it will be slower to compile. By using parsec
excessively you WILL decrease runtime performance (although it might be worh it if compile times are too high).
Appreciate the feedback @josevalim and @tmbb. Moved everything to helpers except the entry point. Shows about 30% performance improvement (in a non scientific timer test). Iâll try some grammar unrolling this weekend to see if I can work around my original issue.
I have written a small library to make such tests as scientific as possible and as easy as possible. Itâs called schism, and itâs available on Github here (no hex package yet): GitHub - tmbb/schism: A library that makes it easy to have alternative implementation of functions for benchmarks
There is a post discussing it here in the forum: Schism - a library to make benchmarking easy (and heretical!)
Basically, it uses some macros to define alternative versions of code fragments (function definitions, parts of function definitions, etc). It then provides some utilities to make it easier to benchmark the changes with Benchee.
You use it like this:
defmodule YourModule do
# Some code, don't need to touch here
# The code here is the same for both versions
schism "inline vs parsec" do
dogma "parsec" do
# single combinator, coded in such a way that it uses parsec
end
heresy "inline" do
# same combinator, but avoiding parsec
end
end
# some more code..
# The code here is the same for both versions
end
With benchee you can then compare the dogma and the heresy and determine objectively what is faster. This is obviously something you can do yourself, but the beauty of it is that it provides a very simple API with almost no friction to test the impact of small changes in single functions on the performance of the whole application, without having to maintain two copies of the application.
EDIT: are you afraid the two implementations of the function are not compatible? Donât worry, the defsnippet
macro in the Schism package handles that situation pretty well. For what itâs worth, Schism was originally written to benchmark alternative implementations of nimble_parsec parsers.
Thats very cool and I canât believe I missed it when you posted in April. Thanks for the assist! And now back to grammar unrolling âŠ
I ran a benchee example of my old parser and the new one using nimble_parsec
with only one parsec/1
call. The old one used an ABNF parser (and the even older one used leex
and yecc
). Really positive results:
Old version: Average of 163.87 ÎŒs
Nimble_parsec version: Average of 28.40 ÎŒs
I think ~6 times faster was worth the evenings work . Thanks again @josevalim and @tmbb
By the way, did you end up using Schism?
Not this time - the dependencies are different (abnf2 which I maintain and nimble_parsec). But for sure its going to be a standard part of my performance improvement activities from now on.
Note: I initially wrote this as an issue on the repo, but I thought it might be better served by being a forum post, as it turned out quite long, and may invite discussion.
@josevalim, let me first start by saying thanks for making this library. I really like the combination of just creating a data structure describing the parser using functions (thus avoiding excessive macro magic and making things as composable as possible), but thenâinstead of interpreting that structure at runtimeâcompiling it into efficient code.
While this isnât explicitly mentioned anywhere in the project docs, the overall style of the library plus the reference to Parsec suggests that the underlying idea (whether originally intentional or not) is to do PEG-style parsing. Even if itâs not an exact implementation of PEG, it would be reasonable to expect that NimbleParsec should be equal in âparsing powerâ, if you will, to a PEG parserâthat is, admit any languages which a PEG parser does.
If you look, for instance, at Wikipediaâs definition of a PEG, you will see the following combinators mentioned as fundamental operators (you can technically get away with fewer, but it starts to get impractical to implement), and most of these do resemble constructs NimbleParsec has:
-
Sequenceâeither
concat/2
or the first argument of a combinator -
Ordered choiceâ
choice/2
implements this -
Zero-or-moreâ
repeat/2
-
One-or-moreâ
times/3
withmin: 1
However, then you come to these:
-
And-predicate (
&
) -
Not-predicate (
!
)
These are the two predicates which denote positive and negative lookahead. Put simply, &
asserts that the parser it wraps matches, and !
asserts that the parser it wraps does not match. Crucially, they donât consume any input. Say that we call them require
and deny
for the purposes of NimbleParsec. That means that require(string("foo")) |> utf8_string([], min: 3)
would parse in any string starting in âfooâ, and deny(utf8_char([?0..?9])) |> utf8_string([], min: 1)
would parse any string not starting with a digit.
N.B. technically you only need !
, since for any parser p
, !!p == &p
in behaviour, but explicitly implementing &
is likely to be more efficient.
Why is proper lookahead support needed?
I know most requests of the form âwe should implement a new combinator in NimbleParsecâ are met with the response of âthis library is supposed to be barebones, and you should make your own combinatorsâ, as well as âwhat is the particular usecaseâ? Therefore, I aim to show a couple of grammars which I believe NimbleParsec cannot currently parse properly, but which would be useful to support, and how they could be supported if require
and unless
are implemented.
N.B. I am not tied to the names require/2
and deny/2
âthey were the first I though of which seemed to fit in with the existing naming theme.
Keywords and identifiers
Letâs imagine weâre parsing a super simple programming language. For now, we only need to care about keywords (letâs say if
and while
) and identifiers (letâs say any string at all in the range of [a-zA-Z0-9]
).
Great, letâs get started! We go ahead and write some code like this:
defmodule NPL.Helpers do
import NimbleParsec
def keyword(c \\ empty()),
do:
c
|> choice([
string("if") |> replace(:if),
string("while") |> replace(:while)
])
|> tag(:kw)
def identifier(c \\ empty()), do: c |> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1) |> tag(:id)
def expr_(),
do:
choice([
keyword(),
identifier()
])
|> optional(string(" ") |> ignore() |> expr())
def expr(c \\ empty()), do: c |> parsec(:expr)
end
defmodule NPL do
import NimbleParsec
import NPL.Helpers
defparsec(
:expr,
expr_()
)
end
Letâs test it:
iex(1)> import NPL
NPL
iex(2)> expr "if"
{:ok, [kw: [:if]], "", %{}, {1, 0}, 2}
iex(3)> expr "foo"
{:ok, [id: ["foo"]], "", %{}, {1, 0}, 3}
iex(4)> expr "foo if"
{:ok, [id: ["foo"], kw: [:if]], "", %{}, {1, 0}, 6}
Seems like it works fine!
iex(5)> expr "iffy"
{:ok, [kw: [:if]], "fy", %{}, {1, 0}, 2}
Uh⊠no, we wanted that to just be identifier. Now we have a keyboard and a failed parse for the rest of the input. Maybe if we switch the order of the choice
around?
def expr_(),
do:
choice([
identifier(),
keyword()
])
|> optional(string(" ") |> ignore() |> expr())
iex(2)> expr "iffy"
{:ok, [id: ["iffy"]], "", %{}, {1, 0}, 4}
iex(3)> expr "if"
{:ok, [id: ["if"]], "", %{}, {1, 0}, 2}
Now identifiers which have a keyword prefix are parsed fine, but we no longer recognise keywords as keywords!
Units
You might say âNimbleParsec was really designed for much simpler parsingâlanguage codes, CSV maybe, simple formats, not full on languages!â Fine, letâs tackle a much simpler problem: parsing measurement units. Letâs define a parser like this:
defmodule NPT.Helpers do
import NimbleParsec
def metric(c \\ empty()),
do:
c
|> choice([
string("mol"),
string("l"),
string("m"),
string("g")
])
|> label("metric base unit")
def datetime(c \\ empty()),
do:
c
|> choice([
string("yr"),
string("mo"),
string("wk"),
string("d"),
string("h")
])
|> label("temporal unit")
def prefix(c \\ empty()),
do:
c
|> choice([
string("m"),
string("k"),
string("c")
])
|> label("metric prefix")
def unit(c \\ empty()),
do:
choice([
datetime() |> tag(:datetime),
metric() |> tag(:metric),
prefix() |> tag(:prefix) |> metric() |> tag(:metric)
])
end
defmodule NPT do
import NimbleParsec
import NPT.Helpers
defparsec(
:parse_unit,
unit()
)
end
iex(1)> parse_unit "m"
{:ok, [metric: ["m"]], "", %{}, {1, 0}, 1}
iex(2)> parse_unit "mol"
{:ok, [datetime: ["mo"]], "l", %{}, {1, 0}, 2}
iex(3)> parse_unit "ml"
{:ok, [metric: ["m"]], "l", %{}, {1, 0}, 1}
Itâs the same issue!
It turns out that PEG parsers (which is how NimbleParsec is acting), in contrast to, for instance, regular expressions, are greedy and donât backtrack. If we have a choice between several underlying parsers, and one of them parses a prefix of the other, we will always have issues. PEG gives us a great deal of control over how exactly parsing takes place, but this also requires us to be very specific where the parser needs to be carefulâthat is, where it needs to check ahead without consuming any input, so it can make a decision on how to consume further input (PEGs are linear, so once the output is consumed, thereâs no going back! In NimbleParsec, once the output is consumed, we have tail-recursed and there is no way to go back!)
Why is lookahead/2
not enough?
This issue has been mentioned on the repo, and as a response, the lookahead/2
function has been implemented. Letâs try fixing our keyword parser using the function, and letâs add a couple more parsecs just for the individual types of objects we have in our grammar, to help with debugging.
Our goal here is to try to say âwhen weâre trying to parse an identifier, first check if the next token up to a space is a keyword. If so, fail the parse (with the expectation that when in a choice block, this would cause the next choice to be taken).â
...
def identifier(c \\ empty()),
do:
c
|> lookahead(:fail_on_keyword)
|> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1)
|> tag(:id)
...
def fail_on_keyword("if " <> _, _, _, _), do: {:error, "if found"}
def fail_on_keyword("while " <> _, _, _, _), do: {:error, "if found"}
def fail_on_keyword(_, ctx, _, _), do: {[], ctx}
...
defparsec(
:keyword,
keyword()
)
defparsec(
:identifier,
identifier()
)
Right off the bat, we see one immediate issue with lookahead/2
: itâs not a composable combinator, but rather some sort of halfway measure, where we need to define a separate function for each little thing we want to check. We also need to ensure this extra function is accessible in the module which defines the parsec! We canât just easily re-use our existing combinators, weâd need to have a separate function call and explicitly call another parsec within it, and return :error
or :ok
based on whether it succeeds. Thatâs quite unergonomic, but perhaps could be solved to some degree with a quoted_lookahead
. Nevertheless, letâs try our new solution out:
iex(2)> keyword "if"
{:ok, [kw: [:if]], "", %{}, {1, 0}, 2}
iex(3)> identifier "iffy"
{:ok, [id: ["iffy"]], "", %{}, {1, 0}, 4}
iex(4)> identifier "if "
{:error, "if found", "if ", %{}, {1, 0}, 0}
Seems like our individual parsers are working ok. Now, letâs try their combination:
iex(6)> expr "if foo"
{:error, "if found", "if foo", %{}, {1, 0}, 0}
And no, this doesnât work. Even if you ignore the ergonomic issues around lookahead
, its fundamental shortcoming is the fact that it doesnât composeâwhen it fails inside a choice
or repeat
block, the failure leaks to the top level of the parser, instead of causing the parent parser to stop looking at that branch and continue.
Proposed solution
Revisiting the keyword example one more time, I think that you should be able to write something like this:
def id_char(c \\ empty()), do: c |> utf8_char([?a..?z, ?A..?Z, ?0..?9])
def keyword(c \\ empty()),
do:
c
|> choice([
string("if") |> replace(:if),
string("while") |> replace(:while)
])
|> deny(id_char())
|> tag(:kw)
# ignore things like actually joining this into a string
def identifier(c \\ empty()),
do:
c
|> times(id_char(), min: 1)
|> tag(:id)
In fact, adding deny
(and having repeat
handle it properly) would allow repeat_until
to be removed, since it could be written like this:
def repeat_until(c \\ empty(), to_repeat, until),
do:
c
|> repeat(
deny(until())
|> to_repeat()
)
As for implementation, it seems to me reading through the compiler code that failed parsers are already specially handled in some cases, but not all (Iâm referring to the catch_all
piece of config which is passed around. I believe that for all parsers (bound and unbound), both repeat
and choice
should respect the failure or success of them. Implementing require
and deny
will require making a stack frame at the point they are invoked (since we need to be able to get back to the state of the binary, accumulator etc right before we called them), which is unavoidable, but we still retain great efficiency and tail recursion for the majority of the code (we only make extra stack frames in specific places, unlike a full backtracking parser, which needs to keep the full stack around).
Please let me know what you think about thisâIâm pretty certain that what I described cannot currently be achieved generally with NimbleParsec, but if I misunderstood how it works somewhere, please let me know. I think implementing this feature would greatly aid the library on its way to be a powerful set of primitives with which fully-featured parsers can be built.
Side note: failures vs errors
While not necessary for more parsing power, it would be useful to distinguish between failures (this parser failed, but it itâs in a choice
or repeat
then try the next choice / stop the repeat) and errors (something unrecoverable was seen, fail right away and show the user an error) in order to allow better error messages to be reported back to the users. Of course, this would need to come after all errors are actually handled consistently as proposed above, and is probably a separate discussion. This paper gives a good overview of how some other parsers handle this and what advantages it has.
I agree, and although NimbleParsec has replaced a little bit of my ExSpirit library usages, but I still use ExSpirit on occasion just because NimbleParsec canât handle some things as well (especially around state information).
The issues in your post are for NimbleParsec supposed to be done by a secondary pass, itâs not a single-pass parsing library (which loses a lot of itâs capabilities).
Iâm sure Jose would be up for accepting PRâs to integrate the features that my ExSpirit library has, the state handling capabilities would be a HUGE boon for me and I could drop ExSpirit if itâs features were ported over.
For note, that same example in ExSpirit (give or take, should actually be parsing out whatever comes after if
/while
):
iex(1)> defmodule Testing do
...(1)> use ExSpirit.Parser, text: true
...(1)>
...(1)> defrule identifier(tag(:id, chars([?a..?z, ?A..?Z, ?0..?9], 1)))
...(1)>
...(1)> defrule keywords(context) do
...(1)> symbols_ =
...(1)> ExSpirit.TreeMap.new()
...(1)> |> ExSpirit.TreeMap.add_text("if", :if) # Value can be a function to return a parser
...(1)> |> ExSpirit.TreeMap.add_text("while", :while) # Value can be a function to return a parser
...(1)> context |> symbols(symbols_) |> lookahead_not(identifier()) # Should actually parse the if/while instead...
...(1)> end
...(1)>
...(1)> defrule expr(alt([
...(1)> keywords(),
...(1)> identifier(),
...(1)> ]))
...(1)> end
{:module, Testing,
<<70, 79, 82, 49, 0, 0, 127, 188, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 3, 7,
0, 0, 0, 84, 14, 69, 108, 105, 120, 105, 114, 46, 84, 101, 115, 116, 105,
110, 103, 8, 95, 95, 105, 110, 102, 111, 95, ...>>, {:expr, 1}}
iex(2)> import ExSpirit.Parser
ExSpirit.Parser
iex(3)> parse("if", Testing.identifier)
%ExSpirit.Parser.Context{
column: 3,
error: nil,
filename: "<unknown>",
line: 1,
position: 2,
rest: "",
result: {:id, "if"},
rulestack: [],
skipper: nil,
state: %{},
userdata: nil
}
iex(4)> parse("if", Testing.identifier).result
{:id, "if"}
iex(5)> parse("iffy", Testing.identifier).result
{:id, "iffy"}
iex(6)> parse("if", Testing.expr).result
:if
iex(7)> parse("iffy", Testing.expr).result
{:id, "iffy"}
I should probably just finish my ExSpirit.Parserx
rewrite (which started just before NimbleParsec came out) as it is faster.
Thank you @mjadczak. This was a very interesting and entertaining read!
NimbleParsec was never meant to be a complete parsec but it doesnât mean it canât be one as long as it fits within its goals.
You are correct that today lookahead can only âerrorâ and I would probably keep the error semantics. It is also worth saying that lookahead is implemented on top of traverse
. Maybe we can introduce {:deny, ...}
to traverse, at least as a starting point, and that should make certain designs possible.
Here is a diff that allows just that:
diff --git a/lib/nimble_parsec/compiler.ex b/lib/nimble_parsec/compiler.ex
index 842ab6f..c0c0233 100644
--- a/lib/nimble_parsec/compiler.ex
+++ b/lib/nimble_parsec/compiler.ex
@@ -277,12 +277,24 @@ defmodule NimbleParsec.Compiler do
quote(do: unquote(next)(rest, acc, stack, context, line, offset))
end
- defp traverse(traversal, next, rest, user_acc, context, line, offset, _) do
+ defp traverse(traversal, next, rest, user_acc, context, line, offset, config) do
case apply_traverse(traversal, rest, user_acc, context, line, offset) do
- {user_acc, ^context} when user_acc != :error ->
+ {user_acc, ^context} when not is_atom(user_acc) ->
quote(do: unquote(next)(rest, unquote(user_acc) ++ acc, stack, context, line, offset))
quoted ->
+ deny =
+ case config do
+ %{catch_all: nil} ->
+ quote do: {:error, reason, rest, context, line, offset}
+
+ %{catch_all: next, acc_depth: n} ->
+ {_, _, _, body} = build_proxy_to(:unused, next, n)
+ body
+ end
+
quote do
case unquote(quoted) do
{user_acc, context} when is_list(user_acc) ->
@@ -290,6 +302,9 @@ defmodule NimbleParsec.Compiler do
{:error, reason} ->
{:error, reason, rest, context, line, offset}
+
+ {:deny, reason} ->
+ unquote(deny)
end
end
end
@@ -299,7 +314,8 @@ defmodule NimbleParsec.Compiler do
apply_traverse(Enum.reverse(mfargs), rest, {acc, context}, line, offset)
end
- defp apply_traverse([mfargs | tail], rest, {acc, context}, line, offset) when acc != :error do
+ defp apply_traverse([mfargs | tail], rest, {acc, context}, line, offset)
+ when not is_atom(acc) do
acc_context = apply_mfa(mfargs, [rest, acc, context, line, offset])
apply_traverse(tail, rest, acc_context, line, offset)
end
@@ -674,7 +690,7 @@ defmodule NimbleParsec.Compiler do
{traverse_line, traverse_offset} = pre_post_traverse(kind, pre_metadata, post_metadata)
case apply_traverse(mfargs, rest, outputs, context, traverse_line, traverse_offset) do
- {outputs, ^context} when outputs != :error ->
+ {outputs, ^context} when not is_atom(outputs) ->
{:ok, inputs, guards, outputs, post_metadata}
_ ->
The diff above will allow you to return {:deny, "not an identifier"}
from lookahead/traverse. The reason
is only used fi you call deny outside of something you cannot escape. Once I do it, this works as expected:
defmodule NPL.Helpers do
import NimbleParsec
def keyword(c \\ empty()),
do:
c
|> choice([
string("if") |> replace(:if),
string("while") |> replace(:while)
])
|> lookahead(:fail_on_identifier)
|> tag(:kw)
def identifier(c \\ empty()),
do:
c
|> utf8_string([?a..?z, ?A..?Z, ?0..?9], min: 1)
|> tag(:id)
def fail_on_identifier(<<a, _::binary>>, _, _, _) when a in ?a..?z, do: {:deny, "identifier"}
def fail_on_identifier(_, ctx, _, _), do: {[], ctx}
def expr_(),
do:
choice([
keyword(),
identifier()
])
|> optional(string(" ") |> ignore() |> expr())
def expr(c \\ empty()), do: c |> parsec(:expr)
end
defmodule NPL do
import NimbleParsec
import NPL.Helpers
defparsec(
:expr,
expr_()
)
end
IO.inspect NPL.expr "iffy"
I agree there are still âlogisticsâ issues. Having to write lookahead for everything is not nice but it should provide a starting point so we can generalize it into a proper combinator.
Iâm not certain that you could do this at all with the (as of writing) NimbleParsec, even with multiple passes. Do you have a quick example of how you could do it in two passes? It seems to me that things like supporting state (which are really just âwhat kind of side computation can we attach to the parsing logicâ) are different from lookahead (which is a fundamental block of the parser). ExSpirit may have both, but the fundamental difference here is that you have lookahead_not
IMO.
Thanks! Iâm going to play around based off that diff, but good to see it can fit somewhat nicely in the existing infrastucture.
I do feel quite strongly that lookahead should be a fundamental combinator rather than a âcall to user codeâ function. It seems like even with your patch, in order to use a pre-existing parser (and I mean just a value or a function, not a parsec
/combinator
you would have to wrap that parser into a defparsec
so that you could call it from the auxiliary traverse
function.
I also think that having this :deny
form be a special case just for this lookahead usecase also hurts composability, which I understood was a core goal of the libraryâit would be nice if I could take any parser at all, which returned an :error
when run by itself, and stick it into :deny
, turning it into a success. Once again Iâll reiterate that from a CS theory point of view, this is a fundamental combinator in a strict sense of âthere are grammars you cannot parse without itâ, which is not the case even for things like repeat
!
Thanks again though, Iâm looking forward to playing around some more with this idea and seeing if I can come up with a nice way to integrate it into the existing structure.
You could be treating it basically as a tokanizer, which does relegate itâs functionality a bit but thatâs mostly how itâs been built to beâŠ
Yep, that is just the biggest feature that I routinely use from ExSpirit that NimbleParsec doesnât have, I can mostly work around the rest, mostly (mine contains a lot more result information that I use quite a bit that NimbleParsec just doesnât have).
Actually everything mine has is built it from a base set of things, and lookahead
/lookahead_not
you can implement entirely in user-code in ExSpirit. NimbleParsec generates code in comparison so what it can do is limited to what itâs code generators know about. Perhaps with some good injection points then people could write their own generatorsâŠ?
That all comes back to how NimbleParsec generates itâs fast code. PRâs can extend it though!
Despite being the instigator of the lookadhead/2
combinator, I have to say I was unaware of how it had been implemented. Like you, I donât like the fact that this combinator doesnât take a parser as argument. This is very important information for my future uses of NimbleParsec.
I think that NimbleParsec should be very clear on what it supports and what it doesnât support. That is important for users to decide whether the library fits their needs. Maybe someone with experience with grammars could spell out how powerful NimbleParsec is exactly? As weâve seen, itâs not a PEG parser, but it might be something slightly weaker but still very powerful.
I very disagree very strongly with this idea. For a parser, itâs very important to decide ahead of time what itâs capabilities are. Adding features by accretion doesnât seem like a good way to do this. (IMO) You shouldnât add features one by one. You should pick a parsing style (Regex, PEG, Context-sensitive [spirit-like]) and then implement it. In stages, if necessary, but with the model picked ahead of time.
Itâs totally OK if NimbleParsec doesnât become any more powerful than it is because that would compromise the main design goal (which is to be crazy fast), but the capabilities of a possible new version should be decided as soon as possible.
Look at the elixir lexer in my makeup project. I parse everything as a keyword and then in a second pass highlight some keywords in a different way. But the initial step treats âifâ and âwhileâ just like âaâ and âbâ. On my phone so no links, sorry.
Wait what? Iâve not seen it yet in NimbleParsec, how does it work without a combinator to give it a success/fail result?! o.O
You have to supply a user-defined function (which needs to be a real def
in the module) which takes the binary and has to return a success or error.