Help with SQLite reliability (idle time & timeout errors)

(TLDR at the end)

I spent the past two days trying to optimize one of the SQLite DBs and operations on them for what would be a rather extreme scenario for us, though not impossible if we assume bugs (my thinking is if it works under these conditions, prod will be a breeze). Unfortunately my efforts resulted in little to no success…

I searched the forum and found a few good posts and articles (e.g., Ecto & sqlite3 database lock issues - #2 by victorbjorklund, and Optimizing SQLite for servers from Ecto & sqlite3 database lock issues - #7 by BartOtten). I’ve skimmed the SQLite docs (especially the PRAGMAs) plus the ecto_sqlite3 and exqlite docs too. I’ve tried a few different things (configs, DB access patterns, indexes, …), but I can’t get much more out of the DB.

So I’ve resigned from that battle for now.

However, there’s still two things bothering me. One is that the idle time is sometimes unreasonably high (e.g. 1608.5ms vs 12.9ms db time). The other is that when a given statement takes more time than the :timeout, an error is logged (suggesting the execution was canceled), but it actually continues through to the end and even returns, because no error is raised. Here are logs showing this (used timeout: 100 to force the error):

# First DB call
[info] MachineManager synchronizing... Current version=0, target=999992
[error] Exqlite.Connection (#PID<0.400.0> ("db_conn_2")) disconnected: ** (DBConnection.ConnectionError) client #PID<0.415.0> (Foo.Office.MachineAggregator) timed out because it queued and checked out the connection for longer than 100ms

#PID<0.415.0> (Foo.Office.MachineAggregator) was at location:

    (exqlite 0.35.0) lib/exqlite/sqlite3.ex:274: Exqlite.Sqlite3.multi_step/3
    (exqlite 0.35.0) lib/exqlite/sqlite3.ex:312: Exqlite.Sqlite3.try_fetch_all/3
    (exqlite 0.35.0) lib/exqlite/sqlite3.ex:306: Exqlite.Sqlite3.fetch_all/3
    (exqlite 0.35.0) lib/exqlite/connection.ex:695: Exqlite.Connection.get_rows/2
    (exqlite 0.35.0) lib/exqlite/connection.ex:641: Exqlite.Connection.execute/4
    (db_connection 2.9.0) lib/db_connection/holder.ex:356: DBConnection.Holder.holder_apply/4
    (db_connection 2.9.0) lib/db_connection.ex:1539: DBConnection.run_execute/5
    (db_connection 2.9.0) lib/db_connection.ex:1587: DBConnection.run_with_retries/5
    (db_connection 2.9.0) lib/db_connection.ex:848: DBConnection.execute/4
    (ecto_sqlite3 0.22.0) lib/ecto/adapters/sqlite3/connection.ex:90: Ecto.Adapters.SQLite3.Connection.query/4
    (ecto_sql 3.13.5) lib/ecto/adapters/sql.ex:1073: Ecto.Adapters.SQL.execute!/5
    (ecto_sql 3.13.5) lib/ecto/adapters/sql.ex:1011: Ecto.Adapters.SQL.execute/6
    (ecto 3.13.5) lib/ecto/repo/queryable.ex:241: Ecto.Repo.Queryable.execute/4
    (ecto 3.13.5) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
    (foo 0.1.0) lib/foo/office/machine_aggregator.ex:172: Foo.Office.MachineAggregator.synchronize/2
    (foo 0.1.0) lib/foo/office/machine_aggregator.ex:149: Foo.Office.MachineAggregator.handle_cast/2
    (stdlib 7.3) gen_server.erl:2460: :gen_server.try_handle_cast/3
    (stdlib 7.3) gen_server.erl:2418: :gen_server.handle_msg/3
    (stdlib 7.3) proc_lib.erl:333: :proc_lib.init_p_do_apply/3

# Result of the first DB call
[debug] QUERY OK db=1346.3ms idle=37.3ms
SELECT s0."id", s0."type", s0."arguments", s0."emitted_at", s0."inserted_at", s0."updated_at", s0."device_id", s0."row_number" FROM (SELECT ss0."id" AS "id", ss0."type" AS "type", ss0."arguments" AS "arguments", ss0."emitted_at" AS "emitted_at", ss0."inserted_at" AS "inserted_at", ss0."updated_at" AS "updated_at", ss0."device_id" AS "device_id", row_number() OVER "device_id_window" AS "row_number" FROM (SELECT sse0."id" AS "id", sse0."type" AS "type", sse0."arguments" AS "arguments", sse0."emitted_at" AS "emitted_at", sse0."inserted_at" AS "inserted_at", sse0."updated_at" AS "updated_at", coalesce(json_extract(sse0."arguments", '$.context.device_id'), json_extract(sse0."arguments", '$.arguments.device_id')) AS "device_id" FROM "events" AS sse0 WHERE (sse0."type" IN (?,?,?,?,?)) AND (sse0."id" > ?) ORDER BY sse0."emitted_at") AS ss0 WHERE (NOT (ss0."device_id" IS NULL)) WINDOW "device_id_window" AS (PARTITION BY ss0."device_id" ORDER BY ss0."emitted_at")) AS s0 WHERE (s0."row_number" <= ?) ["v0:machine:pinged", "v0:machine:reported", "v0:machine:added_pending_action", "v0:machine:canceled_pending_action", "v0:machine:sent_package", 0, 50]
↳ Foo.Office.MachineAggregator.synchronize/2, at: lib/foo/office/machine_aggregator.ex:172

# Second DB call (we can see the first really did succeed because "Current version" is now >0)
# Extra strange is the fact that this second call doesn't log any timeout error
[info] MachineManager synchronizing... Current version=999866, target=999992
# Result of the second DB call
[debug] QUERY OK db=12.9ms idle=1608.5ms
SELECT s0."id", s0."type", s0."arguments", s0."emitted_at", s0."inserted_at", s0."updated_at", s0."device_id", s0."row_number" FROM (SELECT ss0."id" AS "id", ss0."type" AS "type", ss0."arguments" AS "arguments", ss0."emitted_at" AS "emitted_at", ss0."inserted_at" AS "inserted_at", ss0."updated_at" AS "updated_at", ss0."device_id" AS "device_id", row_number() OVER "device_id_window" AS "row_number" FROM (SELECT sse0."id" AS "id", sse0."type" AS "type", sse0."arguments" AS "arguments", sse0."emitted_at" AS "emitted_at", sse0."inserted_at" AS "inserted_at", sse0."updated_at" AS "updated_at", coalesce(json_extract(sse0."arguments", '$.context.device_id'), json_extract(sse0."arguments", '$.arguments.device_id')) AS "device_id" FROM "events" AS sse0 WHERE (sse0."type" IN (?,?,?,?,?)) AND (sse0."id" > ?) ORDER BY sse0."emitted_at") AS ss0 WHERE (NOT (ss0."device_id" IS NULL)) WINDOW "device_id_window" AS (PARTITION BY ss0."device_id" ORDER BY ss0."emitted_at")) AS s0 WHERE (s0."row_number" <= ?) ["v0:machine:pinged", "v0:machine:reported", "v0:machine:added_pending_action", "v0:machine:canceled_pending_action", "v0:machine:sent_package", 999866, 50]
↳ Foo.Office.MachineAggregator.synchronize/2, at: lib/foo/office/machine_aggregator.ex:172

I wanted to know what exactly the db/idle times meant. I found the log, and their meaning (db is the :connection_time):

  • :connection_time - The length of time using the connection (if a connection was used)
  • :idle_time - The amount of time the connection was idle before use

If I understand it correctly, then: db is the time using the connection; and idle is the time between a process checking out a connection and starting to use it?

I also wanted to know what the timeout error meant. I found the log, but I’m having a hard time understanding where that timer is started to continue investigating – there’s two :erlang.start_timer/4 calls, but neither uses a 4-tuple message..?


TLDR: The questions I’m left with:

  1. Which time(s) count(s) for the timeout error? All of them? Only db? This could explain why the second DB call doesn’t timeout; and actually further experiments point in this direction: with timeout: 10 both calls show the error log.
  2. What influences the idle time? Is it even worrying? (maybe not if only db time counts for the timeout?)
  3. Why is the timeout error logged if no error is raised? Is there any scenario where the error is raised (and thus the calling process dies if it doesn’t catch)?
  4. If these timeout errors are worrying, how can I prevent them from happening (other than setting a higher :timeout)?
  5. Do people using other DBs (e.g. PostgreSQL) have similar experiences? Or is this mostly only applicable/common on SQLite?

Thanks a lot in advance!

2 Likes

The current exqlite implementation cannot stop queries. See Replace DirtyNIF execution model with a different mechanism · Issue #192 · elixir-sqlite/exqlite · GitHub

This is for judging how busy your pool of connections is. Problematic are small idle values, not large ones.

4 Likes

I ran into similar issues recently. A lot depends on your SQLite configuration - could you share yours?

You should keep in mind that only one connection can write to the database, while multiple concurrent reads are possible, but only under certain settings.

For example, if you rely on a **:memory:**db, you may need to set read_uncommitted: 1 to allow reads not to be blocked by writes (though be aware that reads may then see uncommitted changes) otherwise all concurrent read operations will be blocked by every single write.
database: “file:db=memory&cache=shared”,
pool_size: 50,
//read_uncommitted allows reads to proceed without waiting for write locks,
// improving read concurrency at the cost of
// potentially reading uncommitted data
custom_pragmas: [read_uncommitted: 1]

If you’re using a file-based database, you should set the journal_mode to WAL (which is usually the default).
database: “priv/data/sqlite.db”,
journal_mode: :wal,
pool_size: 50,

Another important aspect is caching. By caching the output of a SQLite query for just 1 second, I was able to increase throughput from about 1,000 reads/sec to around 2,000,000 reads/sec :sweat_smile:

1 Like

I’d seen that but hadn’t quite connected it to these questions! If queries can’t be stopped anyway, what would I lose by setting a default of timeout: :infinity? Because in practice, from the point of view of the DB, that’s already reality.

Oh, can you confirm then that is incorrect?

In that case, in the logs I showed, the idle time is so high because the connection timedout, and it was only used again on the following query (which was of course executed after the first one finished). Is this correct?

(Thanks a lot btw, your replies are usually as enlightening as they are short! : )

Thanks for the tips, I’d missed the shared cache thing, and the read_uncommitted PRAGMA.

In dev I’m using:

    database: Path.expand(filename, databases_directory),
    pool_size: 5,
    stacktrace: true,
    journal_mode: :wal,
    temp_store: :memory,
    synchronous: :normal,
    foreign_keys: :on,
    cache_size: -64000,
    auto_vacuum: :incremental,
    chunk_size: 100,
    custom_pragmas: [
      # wal_checkpoint: "TRUNCATE",
      # optimize: 0x10002
    ],
    show_sensitive_data_on_connection_error: true

(The :custom_pragmas are commented out because they result in errors and I haven’t taken the time to investigate further)

Yeah, that I’m already doing – it’s actually the operation in the logs, which fetches the last few rows per patition. :slight_smile: It would be good to get more out of this, but right now it’s acceptable because it’s run only once on startup (or process crash).

More problematic is the most costly operation, which is cleanup (scheduled every few hours). One of the steps uses ROW_NUMBER(), I think that’s the biggest reason for the few seconds it takes…

Thanks for the tips, I’d missed the shared cache thing, and the read_uncommitted PRAGMA.

I see u are using file-based db - so u don’t need them

More problematic is the most costly operation, which is cleanup

To address the issue with the long cleanup query, you could try optimizing it or splitting it into several smaller queries, and slightly increasing the timeout and pool values.

pool_size: 50,
busy_timeout: 5_000,

I‘d argue that for the exqlite driver this is a reasonable choice infeed.

1 Like

Is there any more information about this ‘idle time’? I’ve also been struggling to interpret this number. I only have a small scale deployment so not encountered any issues with sqlite but would be nice to understand these metrics a little better. What are (problematic) small values?

Is there any downsize on drastically increasing the pool_size? Currently I use 5 (which I think is the default), any downsides to going to 50? Soon I expect some small bursts of activity for my app, so would be nice to fine-tune in advance so I don’t need to worry about it!

It initially had a connection checked out to run the three steps, now it doesn’t. I have also tried splitting the select from the delete (SELECT id ... and DELETE WHERE id IN ... as two separate DB operations), but it makes it worse ^^’ (selecting alone takes about as much time as the whole select+delete). Since two of the steps are based on time (“delete older than …”), maybe deleting, say, older than 30d and then older than 20d helps, have to try that.

The pool is used both for reads and writes, right? Is there no way to have a pool of 2 procs for writes, and N for reads?

The busy timeout is another thing I don’t yet understand. Doesn’t it just make it wait for longer? What’s the advantage?

The DBConnection pool focuses on tranditional databases where additional connections have real cost. So you want to find the sweetspot of just enough, where connections are well utilized but not overloaded. Hence :idle is an interesting metric to track. For sqlite you can read the issue I posted above for all the reasons why this is not necessarily the best approach to go for with sqlite. While it probably was reasonable to seek ecto integration via dbconnection it would likely be better to have a distinct integration layer between ecto and an sqlite driver - that’s just also the more work option nobody seems to have taken on or maintained to a state where it would be an alternative.

Busy timeout is just a retry on the sqlite level and a really basic one. There’s no queue or something like that. It’s just a plain wait (afaik intervals are variable) and retry for up to the busy timeout limit.

1 Like