Ecto not using indexes in SQLite causing slow query

cd-slash · October 5, 2024, 10:31am

I have a relatively large table (~350k rows and ~50 columns) in an existing GADM database that I’m connecting to with Ecto.

I’ve created indexes on the table with the migration below:

defmodule GADM.Migrations.AddIndexes do
  use Ecto.Migration

  def up do
    create(index(:gadm_410, [:gid_0]))
    create(index(:gadm_410, [:gid_1]))
    create(index(:gadm_410, [:gid_2]))
    create(index(:gadm_410, [:gid_3]))
    create(index(:gadm_410, [:gid_4]))
    create(index(:gadm_410, [:gid_5]))
    create(index(:gadm_410, [:country]))
  end

  def down do
    drop(index(:gadm_410, [:gid_0]))
    drop(index(:gadm_410, [:gid_1]))
    drop(index(:gadm_410, [:gid_2]))
    drop(index(:gadm_410, [:gid_3]))
    drop(index(:gadm_410, [:gid_4]))
    drop(index(:gadm_410, [:gid_5]))
    drop(index(:gadm_410, [:country]))
  end
end

Ecto.Migrator.down(GADMRepo, 0, GADM.Migrations.AddIndexes)
Ecto.Migrator.up(GADMRepo, 0, GADM.Migrations.AddIndexes)

I’m then making a relatively simple query using Ecto:

import Ecto.Query

id_field_name = String.to_atom("gid_3")
child_id_field_name = String.to_atom("gid_4")
name_field_name = String.to_atom("name_3")

GADM.Region
|> where([r], field(r, ^id_field_name) != "")
|> where([r], field(r, ^child_id_field_name) == "")
|> select([r], %{
  gid: field(r, ^id_field_name),
  name: field(r, ^name_field_name)
})
|> distinct(true)
|> GADMRepo.all()

This works, but is very slow:


10:26:39.908 [debug] QUERY OK source="gadm_410" db=2820.3ms queue=0.1ms idle=1446.3ms
SELECT DISTINCT g0."gid_3", g0."name_3" FROM "gadm_410" AS g0 WHERE (g0."gid_3" != '') AND (g0."gid_4" = '') []
↳ :elixir.eval_external_handler/3, at: src/elixir.erl:386

In contrast, when I paste exactly the same SQL as the output shows into SQLIte Studio, it executes in ~1ms, which is what I’d have anticipated with indexes available for all of the columns I’m working with.

To diagnose, firstly I’ve tried to use explain but can’t see any straightforward way to get Ecto to explain a query like this. Is there a function for that?

Secondly, since the indexes are clearly being built and then used when I execute the query elsewhere, why would Ecto not be using them? Even without the explain I can intuit that this is what’s happening as deleting the indexes yields almost identical execution times, so it seems they’re not having any effect.

LostKobrakai · October 5, 2024, 10:48am

https://hexdocs.pm/ecto_sql/3.12.0/Ecto.Adapters.SQL.html#explain/3

Ecto has no influence on the db using an index or not. You’d want to make sure the queries are indeed the same (including using parameters and prepared queries or not).

cd-slash · October 5, 2024, 12:18pm

Ecto.Adapters.SQL — Ecto SQL v3.12.0

Thanks, I’d tried that but couldn’t get it working for SQLite, seemed to only be for Postgres / MySQL?

Ecto has no influence on the db using an index or not. You’d want to make sure the queries are indeed the same (including using parameters and prepared queries or not).

Noted, I’ll try to ensure the queries are identical and continue trying to find a way to profile the query from Ecto

LostKobrakai · October 5, 2024, 12:32pm

Use the Repo.explain variant as documented. This exists for the sqlite driver as well.

cd-slash · October 5, 2024, 12:44pm

Thanks, I was actually able to get it working with the code below:

GADMRepo.explain(
  :all,
  GADM.Region
  |> where([r], field(r, ^id_field_name) != "")
  |> where([r], field(r, ^child_id_field_name) == "")
  |> select([r], %{
    gid: field(r, ^id_field_name),
    name: field(r, ^name_field_name)
  })
  |> distinct(true)
)

This yielded the query plan:

"QUERY PLAN
|--SEARCH g0 USING INDEX gadm_410_gid_4_index (GID_4=?)
`--USE TEMP B-TREE FOR DISTINCT"

This is somewhat unexpected since it shows the plan to use the gid_4 index (though no reference to gid_3 index).

I’ll need to keep exploring with some alternative queries to see how I might be able to speed it up / get it closer to what’s happening in SQLite Studio when I run the SQL there.

cd-slash · October 5, 2024, 1:06pm

To test the indexes, I created some compound indexes with both of the columns that I might be using as below and got some unexpected results.

defmodule GADM.Migrations.AddIndexes do
  use Ecto.Migration

  def up do
    create(index(:gadm_410, [:gid_0]))
    create(index(:gadm_410, [:gid_0, :gid_1]))
    create(index(:gadm_410, [:gid_1]))
    create(index(:gadm_410, [:gid_1, :gid_2]))
    create(index(:gadm_410, [:gid_2]))
    create(index(:gadm_410, [:gid_2, :gid_3]))
    create(index(:gadm_410, [:gid_3]))
    create(index(:gadm_410, [:gid_3, :gid_4]))
    create(index(:gadm_410, [:gid_4]))
    create(index(:gadm_410, [:gid_4, :gid_5]))
    create(index(:gadm_410, [:gid_5]))
    create(index(:gadm_410, [:country]))
  end

  def down do
    drop(index(:gadm_410, [:gid_0]))
    drop(index(:gadm_410, [:gid_0, :gid_1]))
    drop(index(:gadm_410, [:gid_1]))
    drop(index(:gadm_410, [:gid_1, :gid_2]))
    drop(index(:gadm_410, [:gid_2]))
    drop(index(:gadm_410, [:gid_2, :gid_3]))
    drop(index(:gadm_410, [:gid_3]))
    drop(index(:gadm_410, [:gid_3, :gid_4]))
    drop(index(:gadm_410, [:gid_4]))
    drop(index(:gadm_410, [:gid_4, :gid_5]))
    drop(index(:gadm_410, [:gid_5]))
    drop(index(:gadm_410, [:country]))
  end
end

Ecto.Migrator.down(GADMRepo, 0, GADM.Migrations.AddIndexes)
Ecto.Migrator.up(GADMRepo, 0, GADM.Migrations.AddIndexes)

This is my schema, for reference:

defmodule GADMRepo do
  use Ecto.Repo, otp_app: :guyde, adapter: Ecto.Adapters.SQLite3
end

defmodule GADM.Region do
  use Ecto.Schema

  @primary_key {:fid, :id, autogenerate: false}
  schema "gadm_410" do
    field(:name_0, :string)
    field(:name_1, :string)
    field(:name_2, :string)
    field(:name_3, :string)
    field(:name_4, :string)
    field(:name_5, :string)
    field(:gid_0, :string)
    field(:gid_1, :string)
    field(:gid_2, :string)
    field(:gid_3, :string)
    field(:gid_4, :string)
    field(:gid_5, :string)
    field(:geom, :binary)
    field(:country, :string)
    field(:subcont, :string)
    field(:continent, :string)
  end
end

The unexpected result is that it’s using a different index than I’d have expected - with the explain code below to find rows where gid_3 is not empty and gid_4 is empty, it seems to be using the index for gid_4 and gid_5 instead?

"QUERY PLAN\n|--SEARCH g0 USING INDEX gadm_410_gid_4_gid_5_index (GID_4=?)\n`--USE TEMP B-TREE FOR DISTINCT"

Obviously unexpected and peculiar; I will continue to investigate, but any insight into why this would be is greatly appreciated.

ruslandoga · October 5, 2024, 1:44pm

@cd-slash

Just a guess: gid_4 might be picked because your query uses the equality operator with it, and on gid_3 it uses != which probably gets ignored.

Possible workarounds:

running ANALYZE (probably won’t work)
compiling with SQLITE_ENABLE_STAT4 (and running ANALYZE) you can use Built-In Scalar SQL Functions to find out if this option was used, seems like Exqlite compiles with that option by default: exqlite/Makefile at main · elixir-sqlite/exqlite · GitHub
The INDEXED BY Clause (can be added via Ecto.Query — Ecto v3.12.3) or Built-In Scalar SQL Functions
rewriting the query to not use !=, this can be done with a subquery that picks all unique values from gid_3

al2o3cr · October 5, 2024, 6:15pm

Like @ruslandoga said, the != operator is likely the root of your struggles.

Some things to consider trying:

reverse the order in the multicolumn indexes - so [:gid_4, :gid_3] instead of [:gid_3, :gid_4]. That way the initial columns in the index are being matched with =.
create an index on expressions like ["gid_4 != ''", :gid_3] that match the exact expressions being used in the query. (that syntax is likely wrong)

joey_the_snake · October 5, 2024, 10:48pm

Is everything exactly the same when you query outside of Ecto? For example, the table has the same data inside of it?

If the query text is 100% the same in both places the only way they can choose different plans is if the estimated amount of work is different. The most obvious ways this can happen is if the data is different or it can change between analyze runs because those are based on samples. Or maybe your SQL CLI is using a different version of SQLite than Ecto is.

cd-slash · October 7, 2024, 10:33am

It seems like this was accurate - swapping the order of the == and != yielded much better performance. I was able to further improve it by chunking on country which is also indexed, and have got it running ~500x faster at this point, which is good enough for an infrequent task.

Thanks all - appreciate this ended up not being an Ecto issue and more of a general database problem, but I’ve learned a lot about how Ecto works from this too.