For your case, per my comment you linked, I would go with Leex and Yecc. From your description, what you want to parse will have a rigid and well-defined grammar so Leex and Yecc should give you a nice descrptive and concise result.
First of all, thank you for sharing these relevant resources!
I also worked in a similar “toy” web project. However, since I had a client solution in Javascript already working, I kept it even in the followup Liveview version as a hook.
This solution employs nearley js on the client to transform the query to (typically, a plurality of) ASTs in JSON-like format, which I handle at the server and issue Elasticsearch queries via snap.
However, if your project is not web-based, then this would not help you at all.
I am not arguing of course that a javascript solution is preferable over an elixir one, for me the decisive factor was that I already had implemented it and just reused it.
Ah well, it’s about optimising Elasticsearch results. Basically it needs to do some smart tokenization and convert multi words, e.g. Good restaurant into a combo of Good, restaurant and Good restaurant but combine them in different ways depending on followup boolean expressions. So I got that part already working. Once I know the exact operators I can build expected combinations. But, now I need to take care to respect (nested) parenthesis and quoted expressions.
Here’s a few examples:
SAP ABAP OR SAP S/4 HANA should yield ~s{(("SAP"|"ABAP")|"SAP ABAP")|(("SAP"|"S/4"|"HANA")|"SAP S/4 HANA")}
SAP BI AND Java NOT SQL OR Ruby AND NOT Consulting OR Administration results in ~s{(("SAP"+"BI")|"SAP BI")+"Java"-"SQL"|"Ruby"+-"Consulting"|"Administration"}
SAP BI AND Java NOT SQL NOT ( Ruby AND Consulting OR Administration) AND Power BI becomes ~s{(("SAP"+"BI")|"SAP BI")+"Java" -"SQL" -("Ruby"+"Consulting"|"Administration")+(("Power"+"BI")|"Power BI"}
And there are more contrived examples plus additional permutations that derive from above. Nothing fancy but I could use real grammar vs trying to scan input to generate the tree I can then parse (logic already in place).
Hmm, idea has merit definitely. Since there is already more than one solution for JS and we indeed use it for frontend/web part it would actually be simple to just move this to client side. However, we need to support other API clients as well and so it does not make sense to enforce this at client side. Thanks for the links!
Any one of your proposed solutions should work well and require a few dozen lines to tokenize input into an AST ready to be manipulated into ES query syntax.
I would definitely formalize the yet-unstated requirements of how you need to handle parse errors before proceeding to make a choice. Many of these tools produce nice AST on valid input but are ill-suited to providing rich feedback about what went wrong, where, on invalid input.
Do you need to tell users where they are missing a parenthesis, or that something must follow a NOT before a closing paren, and how well do these libraries let you do that? Is there a strategy you can implement for silently ignoring grammar errors in part of the input but accepting the rest, and which of these libraries give you tools to do that? etc.
You can get quite a long way just using the Elixir lexer. Especially now that in Elixir 1.16 its lexing errors are better for user experience.
Example
iex> Code.string_to_quoted "(\"Modesty Blase\" or \"Willie Garvin\")"
{:ok, {:or, [line: 1], ["Modesty Blase", "Willie Garvin"]}}
iex> Code.string_to_quoted "a or (b and c)"
{:ok,
{:or, [line: 1],
[
{:a, [line: 1], nil},
{:and, [line: 1], [{:b, [line: 1], nil}, {:c, [line: 1], nil}]}
]}}
iex> Code.string_to_quoted "a or (b and c"
{:error,
{[opening_delimiter: :"(", line: 1, column: 14],
"missing terminator: ) (for \"(\" starting at line 1)", ""}}
You would then need a parser to ensure you have a valid expression for your query language but I think thats really quite easy compared to the lexical part given your reasonably straight forward requirements.
This idea to just implement binary pattern matching got me thinking. I couldn’t find a way to actually do that. But, during the study of the issue, decided to go with simple scan approach. E.g. just go over each character, split on spaces and count parenthesis and quotes. I have the working solution at Boolean query lexer (not AST unfortunately) · GitHub if you care to take a look. Not simple though and likely hard to debug too. But this bought time so I can try and implement this via some other mechanism.
Good point about silently ignoring grammar errors! Haven’t considered that one and it does make sense. E.g. given search for Modesty Blase OR Willie Garvin AND it makes sense to drop the last AND since it does not serve any purpose.
Would love to formalize requirements but unfortunately this will go the “let’s see how it works and then we’ll improve” way. Not that this is bad, just that it does not yield to formalization well
So for the time being I used a plain approach, scanning the input. See the gist link in reply above. This solution does that, ignores certain input errors (drops tailing operators, uses only the last operator if more than one are supplied, …). It does not say at which character e.g. unclosed parenthesis started, but it should be fairly simple to add that too (it is scanning so counting should not be an issue).
I do plan to try some/all of those solutions and see if I can get a more maintainable solution after all. Hope it will be possible to also ignore certain “invalid” input and have nice enough error reporting.
Yep, replacing would work for operators. Mind you, this is user input so we could expect also things like aNd, And etc. and need to support another language too. But it could be done.
Unfortunately, apart from quoted terms, it needs to support non-quoted multi-word ones, e.g.:
This case breaks it because those terms are indeed user input and not part of any dictionary, so have to take it in and parse as-is (no transformation/downcasing).
If strings can be unquoted then it will become ambiguous grammer right? For example how should FOO AND BAR be parsed? It can be both a boolean expression and a single multi word string.