I realize that this isn’t totally a phoenix/elixir question, but I am asking in the context of phoenix/elixir and the available libraries. This is out of my lane, but I am guessing some libraries use solid tech to solve the existing problem.
We have an app that is currently in production. In the end, it simply presents PDFs for clients to download. Each record has some metadata and some classifying data about the PDF.
There is a VERY rudimentary search; think LIKE %<term>% on a few data rows.
We want to expand this search ability to do something like:
Search more thoroughly across several columns.
Index the PDFs and search them too.
For example, the search for “a flock of blue herons” would yield results for records where indexed fields or text in indexed PDFs had text like the following:
Blue herons often flock on the edge of cliffs.
Herons are sometimes blue and often flock together.
A blue heron by himself is not a flock.
I am not looking for anything more than suggestions for existing elixir/phoenix libraries to use.
Have you taken a look at Postgres full text search? I keep meaning to in my application but I have not used it yet, though I used a similar feature in SQL Server many years ago. You can build a type of index that includes multiple columns, can do partial/fuzzy searches and it ranks results. This article seems decent: Understanding PostgreSQL Full Text Search: 10 Critical Aspects
I’m not sure about searching binary PDFs, you might have to extract text and store that to make it searchable by Postgres.
Few weeks ago I’ve had to try and find a way to refine a similarity search with Nordic names and it was a disaster.
It seems that it’s mostly tailored to English vocabulary – or I didn’t find a way to fine-tune it better for another language, which is also a legit possibility since I gave it like 10 minutes – and we had absurd situations where a looser similarity search skipped a result where the first 2 characters matched and a tighter similarity search included them back… while exactly the opposite happened with the first 3 characters.
Train wreck.
But again, let’s be fully objective: I was given a very short time to play with it so likely missed other possible knobs to turn.
I’ve also used it in a project, however as you said, the results sucked, you have mostly to write complete words for it to work. This was some years ago, and pg_trgm (or something similar) was the only option, however lately I saw that postgres added a lot more options, might be worth investigating.