Recommendation for building search parser

Need to build a “natural” language search query parser, for later conversion to appropriate Elasticsearch query.

Basically need something like GitHub - Financial-Times/n-search-parser: 🔎 A sane, fast, not too smart, search expression parser., which can do:

  • Conjunction operators: AND, OR, NOT
  • Quoted phrases: "Modesty Blase"
  • Grouping with parentheses: ("Modesty Blase" OR "Willie Garvin")

I’m trying to decide which approach to take, e.g.:

As per Best way to build a parser - #3 by david_ex it would seem that Nimble Parsec is the best candidate. There is even Pegasus — pegasus v0.2.4 which could generate Nimble Parsec parsers (can’t find any boolean search PEG definitions out there though!).

But, as far as I can tell, the Leex and Yecc solution would be more concise.

1 Like

Sounds like an interesting project! Can you give a few examples of what should be parsed?

NimbleParsec is the only one I’ve used and I quite like it.

There is also:

1 Like

For your case, per my comment you linked, I would go with Leex and Yecc. From your description, what you want to parse will have a rigid and well-defined grammar so Leex and Yecc should give you a nice descrptive and concise result.

First of all, thank you for sharing these relevant resources!
I also worked in a similar “toy” web project. However, since I had a client solution in Javascript already working, I kept it even in the followup Liveview version as a hook.
This solution employs nearley js on the client to transform the query to (typically, a plurality of) ASTs in JSON-like format, which I handle at the server and issue Elasticsearch queries via snap.
However, if your project is not web-based, then this would not help you at all.
I am not arguing of course that a javascript solution is preferable over an elixir one, for me the decisive factor was that I already had implemented it and just reused it.

Ah well, it’s about optimising Elasticsearch results. Basically it needs to do some smart tokenization and convert multi words, e.g. Good restaurant into a combo of Good, restaurant and Good restaurant but combine them in different ways depending on followup boolean expressions. So I got that part already working. Once I know the exact operators I can build expected combinations. But, now I need to take care to respect (nested) parenthesis and quoted expressions.

Here’s a few examples:

  • SAP ABAP OR SAP S/4 HANA should yield ~s{(("SAP"|"ABAP")|"SAP ABAP")|(("SAP"|"S/4"|"HANA")|"SAP S/4 HANA")}
  • SAP BI AND Java NOT SQL OR Ruby AND NOT Consulting OR Administration results in ~s{(("SAP"+"BI")|"SAP BI")+"Java"-"SQL"|"Ruby"+-"Consulting"|"Administration"}
  • SAP BI AND Java NOT SQL NOT ( Ruby AND Consulting OR Administration) AND Power BI becomes ~s{(("SAP"+"BI")|"SAP BI")+"Java" -"SQL" -("Ruby"+"Consulting"|"Administration")+(("Power"+"BI")|"Power BI"}

And there are more contrived examples plus additional permutations that derive from above. Nothing fancy but I could use real grammar vs trying to scan input to generate the tree I can then parse (logic already in place).

1 Like

Yeah, saw Pegasus project too. Would be nice if there was a PEG definition of boolean search grammar to reuse here :+1:

Yeah, that is what I am leaning towards too. Thanks!

Hmm, idea has merit definitely. Since there is already more than one solution for JS and we indeed use it for frontend/web part it would actually be simple to just move this to client side. However, we need to support other API clients as well and so it does not make sense to enforce this at client side. Thanks for the links!

Based on your stated requirements

Any one of your proposed solutions should work well and require a few dozen lines to tokenize input into an AST ready to be manipulated into ES query syntax.

I would definitely formalize the yet-unstated requirements of how you need to handle parse errors before proceeding to make a choice. Many of these tools produce nice AST on valid input but are ill-suited to providing rich feedback about what went wrong, where, on invalid input.

Do you need to tell users where they are missing a parenthesis, or that something must follow a NOT before a closing paren, and how well do these libraries let you do that? Is there a strategy you can implement for silently ignoring grammar errors in part of the input but accepting the rest, and which of these libraries give you tools to do that? etc.

2 Likes

Do you need a general purpose parser? Or can you implement what you need with binary pattern matching yourself?

You can get quite a long way just using the Elixir lexer. Especially now that in Elixir 1.16 its lexing errors are better for user experience.

Example

iex> Code.string_to_quoted "(\"Modesty Blase\" or \"Willie Garvin\")"
{:ok, {:or, [line: 1], ["Modesty Blase", "Willie Garvin"]}}

iex> Code.string_to_quoted "a or (b and c)"
{:ok,
 {:or, [line: 1],
  [
    {:a, [line: 1], nil},
    {:and, [line: 1], [{:b, [line: 1], nil}, {:c, [line: 1], nil}]}
  ]}}

iex> Code.string_to_quoted "a or (b and c"
{:error,
 {[opening_delimiter: :"(", line: 1, column: 14],
  "missing terminator: ) (for \"(\" starting at line 1)", ""}}

You would then need a parser to ensure you have a valid expression for your query language but I think thats really quite easy compared to the lexical part given your reasonably straight forward requirements.

5 Likes

Hmmm, interesting idea! Unfortunately, it does not solve certain edge cases out of the box, e.g.:

iex> Code.string_to_quoted(~s{"Modesty Blaze" AND "Willie Garvin"})
{:error, {[line: 1, column: 17], "syntax error before: ", "'AND'"}}

Didn’t have time to inspect further since I ended up with a solution that just scans the input, see next answers.

This idea to just implement binary pattern matching got me thinking. I couldn’t find a way to actually do that. But, during the study of the issue, decided to go with simple scan approach. E.g. just go over each character, split on spaces and count parenthesis and quotes. I have the working solution at Boolean query lexer (not AST unfortunately) · GitHub if you care to take a look. Not simple though and likely hard to debug too. But this bought time so I can try and implement this via some other mechanism.

Good point about silently ignoring grammar errors! Haven’t considered that one and it does make sense. E.g. given search for Modesty Blase OR Willie Garvin AND it makes sense to drop the last AND since it does not serve any purpose.

Would love to formalize requirements but unfortunately this will go the “let’s see how it works and then we’ll improve” way. Not that this is bad, just that it does not yield to formalization well :smile:

So for the time being I used a plain approach, scanning the input. See the gist link in reply above. This solution does that, ignores certain input errors (drops tailing operators, uses only the last operator if more than one are supplied, …). It does not say at which character e.g. unclosed parenthesis started, but it should be fairly simple to add that too (it is scanning so counting should not be an issue).

I do plan to try some/all of those solutions and see if I can get a more maintainable solution after all. Hope it will be possible to also ignore certain “invalid” input and have nice enough error reporting.

Thanks for the idea! :bowing_man:

3 Likes

That’s because the AND is uppercased, might be a simple fix with a String.replace ?

Yep, replacing would work for operators. Mind you, this is user input so we could expect also things like aNd, And etc. and need to support another language too. But it could be done.

Unfortunately, apart from quoted terms, it needs to support non-quoted multi-word ones, e.g.:

iex(1)> Code.string_to_quoted(~s{SAP ABAP})
{:error, {[line: 1, column: 5], "syntax error before: ", "'ABAP'"}}

This case breaks it because those terms are indeed user input and not part of any dictionary, so have to take it in and parse as-is (no transformation/downcasing).

If strings can be unquoted then it will become ambiguous grammer right? For example how should FOO AND BAR be parsed? It can be both a boolean expression and a single multi word string.

@elvanja You can take a look at what I have done here https://github.com/themusicman/predicated/blob/main/lib/predicated/query/parser.ex.

Here are some examples in the limited tests: https://github.com/themusicman/predicated/blob/main/test/predicated/query/parser_test.exs

I am stuck at the moment on supporting parans and nesting.

1 Like

Well, I had a break through last night and added support for grouping and nested grouping. Might be some bugs in there but I think it generally works.

1 Like