Fuzzy search and return the rows with the computed distance score

keslert · November 12, 2021, 10:52pm

I’m using levenshtein_less_equal to do a fuzzy search for some tables. It’s working well, but I’d like to also return the levenshtein distance (using a select_merge?) so that I can compare the results against a different table’s results.

@threshold 5
def fuzzy_slug_query(query, slug) do
  slug_length = String.length(slug) + 1

  from(q in query,
    where:
      fragment(
        "levenshtein_less_equal(SUBSTRING(?, 0, ?), ?, 1, 1, 2, ?)",
        q.slug,
        ^slug_length,
        ^slug,
        @threshold
      ) < 5,
    or_where: ilike(q.slug, ^"%#{slug}%"),
    order_by: [
      fragment(
        "levenshtein_less_equal(SUBSTRING(?, 0, ?), ?, 1, 1, 2, ?)",
        q.slug,
        ^slug_length,
        ^slug,
        @threshold
      ),
      fragment("LENGTH(?)", q.slug)
    ]
  )
end

keslert · November 15, 2021, 10:58pm

Any thoughts on this? Is the question clear?

Basically, I want to query two separate tables (books and series), combine the results, and do a final sort by the levenshtein distance.

kip · November 15, 2021, 11:55pm

I think you should be able to use just the levenshtein function (not the _less_equal version) to return the distance as a number. Then filter in a where clause with the distance < @threshold, then join/sort whatever you need. Does that make sense?

keslert · November 16, 2021, 1:52am

Thanks @kip. I’m using levenshtein_less_equal for efficiency reasons, but I’ll simplify things for a second because I realize it’s not important to my question. Do you mean something like this? Do I have to compute the levenshtein three separate times (where, order, and select)?

from(q in query,
  where: fragment("levenshtein(?, ?)", q.slug, ^slug) < @threshold,
  order_by: fragment("levenshtein(?, ?)", q.slug, ^slug),
  select_merge: %{distance: fragment("levenshtein(?, ?)", q.slug, ^slug)}
)

kip · November 16, 2021, 3:09am

I think you should be able to do something like:

from(q in query,
  where: q.distance < @threshold,
  order_by: q.distance,
  select_merge: %{distance: fragment("levenshtein(?, ?)", q.slug, ^slug)}
)

But I haven’t tested it. There should be no need to calculate more than once.

keslert · November 16, 2021, 4:54pm

That gives me the following error:

(Ecto.QueryError) field `distance` in `where` does not exist in schema MyApp.Book in query:

from b0 in MyApp.Book,
  where: b0.distance < 5,
  order_by: [asc: b0.distance],
  select: merge(b0, %{distance: fragment("levenshtein(?, ?)", b0.slug, ^...)})

kip · November 16, 2021, 11:32pm

Ah, yes, not possible to reuse an expression from a SELECT clause in a WHERE clause. This because in SQL the order of evaluation is join -> projection -> selection -> order and therefore calculated values in a SELECT are not available to the prior projection phase.

That said, Postgres does a good job of optimising multiple function calls with the same arguments to one call only. Per the documentation:

An IMMUTABLE function cannot modify the database and is guaranteed to return the same results given the same arguments forever. This category allows the optimizer to pre-evaluate the function when a query calls it with constant arguments. For example, a query like SELECT ... WHERE x = 2 + 2 can be simplified on sight to SELECT ... WHERE x = 4, because the function underlying the integer addition operator is marked IMMUTABLE.

And for the levenshtein function we can see it is defined as:

ecto_playground=# select pg_get_functiondef('levenshtein(text,text)'::regprocedure);
                     pg_get_functiondef                      
-------------------------------------------------------------
 CREATE OR REPLACE FUNCTION public.levenshtein(text, text)  +
  RETURNS integer                                           +
  LANGUAGE c                                                +
  IMMUTABLE PARALLEL SAFE STRICT                            +
 AS '$libdir/fuzzystrmatch', $function$levenshtein$function$+
 
(1 row)

Therefore I would go with your original query (in the first post of the thread). Sorry for the distraction.

keslert · November 17, 2021, 3:26pm

Okay, that makes sense. I didn’t know about the order of evaluation or the optimization of multiple function calls with the same arguments. That’s very helpful—thanks!