Search engine without external database

lud · February 27, 2020, 2:45pm

Hello,

I have a small project I will work on with a guy who is not a developer. He wants to be able to work on this tool by himself so I will kickstart the project and give him a dev environment. He will work on the client/javascript part.

I will write a tiny backend with Phoenix, and one important aspect of the project is to be able to type a search string in an input field and get a selection of matching object. For example if I type "table blu" I will get those matched names : "Small Blue turntables", "Big table", "Blurry thing", etc. If it is too complicated to implement it is ok to have to type *table to get "Small Blue turntables".

The search actually returns item IDs, not names.

That would be easy with PostgreSQL, here is the problem :

Connectivity may be limited sometimes so we want everything to work on a single laptop (no external tool like firebase).
His laptop is shared with its family, I would like to avoid installing Postgres or Docker.
The items table have a name field, but also another field for the name in French. Mybe that would be in another table if we want to add more languages. “table” is the same word in both languages so when I type a string, I have to search each word in both languages.
This setup is only for developing the app, if we make it public, I will have installed Postgres.
There will be around 10K items, definitely no more than 20K. When searching for an item, we will search within one of approx. 10 categories, so over 2000 items on average.

I know this sounds like making things complicated for no reason but I believe there is one way to make it run fast.

I was about to load everything in ETS or Mnesia, building a table per category and just walk over the table with a regex, but before I would like to know if someone here had a smart idea, because it will take time to build.
Also there are no table writes, item tables are static, maybe I could extract all unique words form all names, list all items for each of those words, and define static lookup functions with a macro. But regexes are not supported in guards so that would be limited to full and exact words.

Thanks for reading !

lucaong · February 27, 2020, 2:51pm

This sounds like a great use-case for MiniSearch, a client-side full-text search engine written in JavaScript, small enough to run in the browser and with zero dependencies (disclaimer: I am the author of the library).

MiniSearch is routinely used for searching amongst tens of thousands of small items (e.g. all products in the catalog of a supplier), and the data can be indexed upon each page load (it usually takes sub-second, and can be done asynchronously).

Here’s a demo on a database of ~5000 songs.

al2o3cr · February 27, 2020, 2:52pm

Using an online IDE could work around a lot of the “can’t install PG or Docker locally” issues without writing any extra code - something like https://www.gitpod.io for instance.

lud · February 27, 2020, 2:54pm

But it requires to download all the data in the browser ? That could be acceptable since it is just a temporary hack.

lud · February 27, 2020, 2:55pm

Well as I’ve said our connectivity could be limited. I’m not sure if I use the right words but it just boils down to “no internet” sometimes

lucaong · February 27, 2020, 3:03pm

Yes, it requires transferring all data to the browser, but if the items are small it can be surprisingly fast (the demo that I sent does the same: it transfers and re-index the whole collection at page load). JSON can be compressed a lot, so with server-side caching and compression you can go a long way. Download and indexing can both happen asynchronously, so you don’t have to block the UI even if it would take a couple of seconds.

I know that client-side full-text search might sound strange, but if the size of the data allows for it (and 20,000 small items can easily be within the limits), consider the following advantages:

You would not have to run a search server
No need to setup an indexing pipeline to (re) index new or updated records
No network latency, search can happen as you type
It can work completely offline, once the collection is loaded
You get an auto-completion engine too in case you need that feature
You still get fuzzy search (robust to misspelling), prefix search, etc.

Of course, if the data does not fit in the browser you’ll have to resort to a search server (ElasticSearch, Solr, or even the Postgres full-text capabilities), but my suggestion is to seriously consider client-side search. It helped several of my projects majorly

lud · February 27, 2020, 3:14pm

It does not sound weird at all, it is actually nice. Data will be small since the browser will be fed with only id id/name/name_fr subset of the data, as there is no need to download the full items database.

And so instead of loading my data in tables I just have to build static lookup modules with macros.

Thank you !

lucaong · February 27, 2020, 3:20pm

Welcome in case your project is using React, you can also use the React wrapper for MiniSearch.

lud · February 27, 2020, 4:06pm

I generally go with Svelte but that will be good guidelines