Objectives
I am trying to parallelize a little bit of database workload with Elixir and Ecto. There is no data inconsistency expected, as each SQL query is expected to read the database and then write its own unique result. In other words, the queries do not depend on each other.
It looks like trying to run multiple independent connections with a properly configured connection pool might help. In fact, it might be more efficient than using PosgreSQL
’s parallel query
. Therefore, I tried to implement the following approach: it works better if you queue work
My objective is to use multiple connections to PostgreSQL so that it can run several queries simultaneously.
Background
I have implemented a very simple script that for N amount of independent queries runs an Enum.each
that makes a query against the database.
Then I upgraded it so that for N amount of independent queries it splits them in batches of size M and runs Task.async for each batch.
def parallel_split(times) do
if times > @split do
Logger.info "Splitting"
1..@split
|> Enum.map(fn _x -> Task.async(fn -> run_postgresql_query() end) end)
|> Enum.map(&Task.await/1)
parallel_split(times - @split)
else
1..times
|> Enum.map(fn _x -> Task.async(fn -> run_postgresql_query() end) end)
|> Enum.map(&Task.await/1)
end
end
The Task.async
version seemed to run faster, because apparently many threads are used to issue and queue queries (Elixir/Ecto) and many threads are used by PostgreSQL
to run the queries. Here is how it looks when running:
Let’s say it takes around 84 seconds to process 50_000 queries.
Problem
However, when I running my script with mix run --no-halt
I noticed that htop
showed 8 busy threads for some time, but then only 1 thread stayed busy (around 90% of one CPU core load) for some time.
In fact, while running, the script uses all 8 cores for about 25 seconds and then the rest of the time it uses only 1.
Let us call this “Issue #02”.
I was curious to investigate, because it appeared to me that for some reason after correctly firing async
tasks against the Ecto’s queueing mechanism and getting the results back my script got stuck apparently waiting for some response from the database.
I enabled telemetry for Ecto, but looking at query/queue time for each individual query did not help much.
Observer
Then I figured that I could see the queue table with Observer
.
I use :observer.start
in the same iex
session before starting my main script.
Before the script starts, the Elixir.DBConnection.ConnectionPool.Queue
table has 16 records, which matches the pool_size
in the Repo
config of my app.
After the script starts, the Elixir.DBConnection.ConnectionPool.Queue
gets around M records (the batch size) which eventually go away (I guess this is the actual queuing process in action).
All is as expected, but the running time for the same 50_000 queries gets down to just 24 seconds!
I double-checked and ran the script again with mix
and without Observer
- same slow 84 seconds again!
And, I double-checked running it with iex
and Observer
- that “only one CPU thread working” issue (Issue #02) does not happen when using iex
and Observer
.
Question
Why a function that runs many database queries with a separate Task.async
call for each query, runs faster when launched from iex
and observed with observer
?