Shopify feature: Full text search or joins to find text in multiple tables

joselo · April 9, 2021, 5:47pm

Hi every one.

I’m building a kind of Shopify clone in Phoenix, and I need to implement a navigation menu, so the idea is find any occurrence in the Link section and find the text in different schemas or tables, this is a full-text search, I read that I can do using PostgreSQL creating a materialized view, etc. or the other solution could be just build an Ecto query with joins.

My question is:

Do you know which could be a good approach to build the query for this feature?. As you can see in this example there could be 4 “schemas”: Product, Collection, Pages.

If anyone has been experience building something likes this or have a good approach to build the query, I really appreciate your help.

Btw: I would like to avoid to use ElastiSearch, since Postgres has a good features to support indexing (if there is some approach with indexing).

Thanks!

amnu3387 · April 10, 2021, 11:37am

You need to do a couple of things in order to have this working, you could probably do joins but I personally prefer the materialised view approach with a background process that just refreshes it. This is viable if you don’t need the search to be reflected immediately as long as you can create a unique index on the table and Postgresql will be able to refresh the view concurrently.

The process can be summarised as:

Create the materialised view

def change do
    execute("""
    CREATE MATERIALIZED VIEW your_search_table AS
    SELECT 'product' AS result_type, products.id, products.title AS display_text, to_tsvector('english', products.title) AS tsv_search, product_categories.category_id AS category_id, product_subcategories.subcategory_id AS subcategory_id FROM products
    LEFT JOIN product_categories ON product_categories.product_id = products.id
    LEFT JOIN product_subcategories ON product_subcategories.product_id = products.id
    WHERE products.published
    UNION ALL
    SELECT 'store' AS result_type, store.id, stores.company_name AS display_text, to_tsvector('english', stores.company_name) AS tsv_search, NULL AS category_id, NULL AS subcategory_id FROM stores WHERE stores.type = 'marketplace'
    """)

    execute("CREATE UNIQUE INDEX id_type_index ON your_search_table (result_type, id);")
    execute("CREATE INDEX tsv_index ON your_search_table USING gin(tsv_search);")
    execute("CREATE INDEX categories_index ON your_search_table (category_id, subcategory_id) WHERE result_type = 'product'")
end

Notice this had things removed like setting weights on the tsv’s so it might not work in copy paste but should give you an idea.
Perhaps you should also add a reverse action to the execute statement. The unique index is necessary for the concurrent refresh. By using a type and an id where each id is unique in the scope of the type you can be sure that index is unique. If you can’t for some reason have that then you would create new materialised views on the fly, and rename & drop them as you go, so you have always a table ready and no queries get blocked waiting for the refresh.

define a schema for it:

defmodule SearchTable do
  use Ecto.Schema

  @derive {Jason.Encoder, only: [:result_type, :id, :display_text]}
  @primary_key false
  schema("your_search_table") do
    field :result_type,                     :string, primary_key: true
    field :id,                                     :id, primary_key: true
    field :display_text,                    :string
    field :tsv_search,                      TSVector.Type
    field :category_id,                    :id
    field :subcategory_id,              :id
  end

end

Define a type for the TSVector:

defmodule TSVector.Type do
  @behaviour Ecto.Type

  def type, do: :tsvector
  def cast(tsvector), do: {:ok, tsvector}
  def load(tsvector), do: {:ok, tsvector}
  def dump(tsvector), do: {:ok, tsvector}
end

Now to query the table, you’ll need to have a fragment to make it usable in normal queries, so defining one somewhere and importing it into whatever context queries:

defmacro tsquery(field, text) do
    quote do
      fragment("?::tsvector @@ to_tsquery('english', ?)", unquote(field), unquote(text))
    end
end

You’ll also need to prepare the text for the tsquery, when it comes in:

def split_text_for_tsquery(text) do
    String.split(text, " ", trim: true)
    |> Enum.reject(fn(text) -> Regex.match?(~r/\(|\)\[|\]\{|\}/, text) end)
    |> Enum.map(fn(token) ->  token <> ":*" end)
    |> Enum.intersperse(" & ")
    |> Enum.join
end

Then you can use it as:

def search(terms, params) do
    case split_text_for_tsquery(terms) do
      "" -> []
      search ->
        YourSearchTable
        |> where([yst], tsquery(yst.tsv_search, ^search))
        |> Db.Repo.all
    end
end

Finally you’ll need to keep the table updated. You can write a process that loops itself every X and just refreshes it, like every 5min.

 Db.Repo.query("REFRESH MATERIALIZED VIEW CONCURRENTLY #{@your_search_table_name}", [], [timeout: 360_000])

joselo · April 10, 2021, 4:22pm

Awesome, thanks so much for such a complete answer, I will start with this approach.