Recommendation for building search parser

elvanja · November 17, 2023, 10:21pm

Need to build a “natural” language search query parser, for later conversion to appropriate Elasticsearch query.

Basically need something like GitHub - Financial-Times/n-search-parser: 🔎 A sane, fast, not too smart, search expression parser., which can do:

Conjunction operators: AND, OR, NOT
Quoted phrases: "Modesty Blase"
Grouping with parentheses: ("Modesty Blase" OR "Willie Garvin")

I’m trying to decide which approach to take, e.g.:

As per Best way to build a parser - #3 by david_ex it would seem that Nimble Parsec is the best candidate. There is even Pegasus — pegasus v0.2.4 which could generate Nimble Parsec parsers (can’t find any boolean search PEG definitions out there though!).

But, as far as I can tell, the Leex and Yecc solution would be more concise.

dimitarvp · November 18, 2023, 12:07am

Sounds like an interesting project! Can you give a few examples of what should be parsed?

cmo · November 18, 2023, 2:00am

NimbleParsec is the only one I’ve used and I quite like it.

There is also:

david_ex · November 18, 2023, 8:31am

For your case, per my comment you linked, I would go with Leex and Yecc. From your description, what you want to parse will have a rigid and well-defined grammar so Leex and Yecc should give you a nice descrptive and concise result.

bdarla · November 18, 2023, 9:08am

First of all, thank you for sharing these relevant resources!
I also worked in a similar “toy” web project. However, since I had a client solution in Javascript already working, I kept it even in the followup Liveview version as a hook.
This solution employs nearley js on the client to transform the query to (typically, a plurality of) ASTs in JSON-like format, which I handle at the server and issue Elasticsearch queries via snap.
However, if your project is not web-based, then this would not help you at all.
I am not arguing of course that a javascript solution is preferable over an elixir one, for me the decisive factor was that I already had implemented it and just reused it.

elvanja · November 18, 2023, 6:36pm

Ah well, it’s about optimising Elasticsearch results. Basically it needs to do some smart tokenization and convert multi words, e.g. Good restaurant into a combo of Good, restaurant and Good restaurant but combine them in different ways depending on followup boolean expressions. So I got that part already working. Once I know the exact operators I can build expected combinations. But, now I need to take care to respect (nested) parenthesis and quoted expressions.

Here’s a few examples:

SAP ABAP OR SAP S/4 HANA should yield ~s{(("SAP"|"ABAP")|"SAP ABAP")|(("SAP"|"S/4"|"HANA")|"SAP S/4 HANA")}
SAP BI AND Java NOT SQL OR Ruby AND NOT Consulting OR Administration results in ~s{(("SAP"+"BI")|"SAP BI")+"Java"-"SQL"|"Ruby"+-"Consulting"|"Administration"}
SAP BI AND Java NOT SQL NOT ( Ruby AND Consulting OR Administration) AND Power BI becomes ~s{(("SAP"+"BI")|"SAP BI")+"Java" -"SQL" -("Ruby"+"Consulting"|"Administration")+(("Power"+"BI")|"Power BI"}

And there are more contrived examples plus additional permutations that derive from above. Nothing fancy but I could use real grammar vs trying to scan input to generate the tree I can then parse (logic already in place).

elvanja · November 18, 2023, 6:37pm

Yeah, saw Pegasus project too. Would be nice if there was a PEG definition of boolean search grammar to reuse here

elvanja · November 18, 2023, 6:38pm

Yeah, that is what I am leaning towards too. Thanks!

elvanja · November 18, 2023, 6:40pm

Hmm, idea has merit definitely. Since there is already more than one solution for JS and we indeed use it for frontend/web part it would actually be simple to just move this to client side. However, we need to support other API clients as well and so it does not make sense to enforce this at client side. Thanks for the links!

christhekeele · November 19, 2023, 10:35pm

Based on your stated requirements

Any one of your proposed solutions should work well and require a few dozen lines to tokenize input into an AST ready to be manipulated into ES query syntax.

I would definitely formalize the yet-unstated requirements of how you need to handle parse errors before proceeding to make a choice. Many of these tools produce nice AST on valid input but are ill-suited to providing rich feedback about what went wrong, where, on invalid input.

Do you need to tell users where they are missing a parenthesis, or that something must follow a NOT before a closing paren, and how well do these libraries let you do that? Is there a strategy you can implement for silently ignoring grammar errors in part of the input but accepting the rest, and which of these libraries give you tools to do that? etc.

Adzz · November 20, 2023, 2:05am

Do you need a general purpose parser? Or can you implement what you need with binary pattern matching yourself?

kip · November 20, 2023, 4:31am

You can get quite a long way just using the Elixir lexer. Especially now that in Elixir 1.16 its lexing errors are better for user experience.

Example

iex> Code.string_to_quoted "(\"Modesty Blase\" or \"Willie Garvin\")"
{:ok, {:or, [line: 1], ["Modesty Blase", "Willie Garvin"]}}

iex> Code.string_to_quoted "a or (b and c)"
{:ok,
 {:or, [line: 1],
  [
    {:a, [line: 1], nil},
    {:and, [line: 1], [{:b, [line: 1], nil}, {:c, [line: 1], nil}]}
  ]}}

iex> Code.string_to_quoted "a or (b and c"
{:error,
 {[opening_delimiter: :"(", line: 1, column: 14],
  "missing terminator: ) (for \"(\" starting at line 1)", ""}}

You would then need a parser to ensure you have a valid expression for your query language but I think thats really quite easy compared to the lexical part given your reasonably straight forward requirements.

elvanja · November 27, 2023, 2:36pm

Hmmm, interesting idea! Unfortunately, it does not solve certain edge cases out of the box, e.g.:

iex> Code.string_to_quoted(~s{"Modesty Blaze" AND "Willie Garvin"})
{:error, {[line: 1, column: 17], "syntax error before: ", "'AND'"}}

Didn’t have time to inspect further since I ended up with a solution that just scans the input, see next answers.

elvanja · November 27, 2023, 2:52pm

This idea to just implement binary pattern matching got me thinking. I couldn’t find a way to actually do that. But, during the study of the issue, decided to go with simple scan approach. E.g. just go over each character, split on spaces and count parenthesis and quotes. I have the working solution at Boolean query lexer (not AST unfortunately) · GitHub if you care to take a look. Not simple though and likely hard to debug too. But this bought time so I can try and implement this via some other mechanism.

elvanja · November 27, 2023, 2:58pm

Good point about silently ignoring grammar errors! Haven’t considered that one and it does make sense. E.g. given search for Modesty Blase OR Willie Garvin AND it makes sense to drop the last AND since it does not serve any purpose.

Would love to formalize requirements but unfortunately this will go the “let’s see how it works and then we’ll improve” way. Not that this is bad, just that it does not yield to formalization well

So for the time being I used a plain approach, scanning the input. See the gist link in reply above. This solution does that, ignores certain input errors (drops tailing operators, uses only the last operator if more than one are supplied, …). It does not say at which character e.g. unclosed parenthesis started, but it should be fairly simple to add that too (it is scanning so counting should not be an issue).

I do plan to try some/all of those solutions and see if I can get a more maintainable solution after all. Hope it will be possible to also ignore certain “invalid” input and have nice enough error reporting.

Thanks for the idea!

Adzz · November 27, 2023, 6:25pm

That’s because the AND is uppercased, might be a simple fix with a String.replace ?

elvanja · November 30, 2023, 7:46am

Yep, replacing would work for operators. Mind you, this is user input so we could expect also things like aNd, And etc. and need to support another language too. But it could be done.

Unfortunately, apart from quoted terms, it needs to support non-quoted multi-word ones, e.g.:

iex(1)> Code.string_to_quoted(~s{SAP ABAP})
{:error, {[line: 1, column: 5], "syntax error before: ", "'ABAP'"}}

This case breaks it because those terms are indeed user input and not part of any dictionary, so have to take it in and parse as-is (no transformation/downcasing).

akash-akya · November 30, 2023, 10:16am

If strings can be unquoted then it will become ambiguous grammer right? For example how should FOO AND BAR be parsed? It can be both a boolean expression and a single multi word string.

th3mus1cman · December 4, 2023, 12:54am

@elvanja You can take a look at what I have done here https://github.com/themusicman/predicated/blob/main/lib/predicated/query/parser.ex.

Here are some examples in the limited tests: https://github.com/themusicman/predicated/blob/main/test/predicated/query/parser_test.exs

I am stuck at the moment on supporting parans and nesting.

th3mus1cman · December 4, 2023, 1:07pm

Well, I had a break through last night and added support for grouping and nested grouping. Might be some bugs in there but I think it generally works.