Lazy connection pooling for Ecto 2.0?

jwarlander · May 10, 2016, 2:33pm

I’ve been browsing Ecto documentation and some of the source code, but haven’t found any obvious way to configure a connection pool for ‘lazy’ checkouts.

What I’d like is for Ecto not to start up database connections until they’re needed, and just keep them open for a defined period after they’re checked in again (eg. 1-5 minutes perhaps), so that we don’t push the connection limits on our (shared) PostgreSQL instance too far.

Does anyone know if this is possible to do with either the Poolboy or Sojourn broker in Ecto 2.0?

jwarlander · May 11, 2016, 2:48pm

After digging into the source code for db_connection, it seems as if there isn’t a straightforward way to do lazy pooling; both Poolboy and Sojourn initiate the connections more or less immediately at launch.

Possibly one might be able to modify DBConnection.Connection so that it doesn’t actually connect until checkout is called… and then it should be possible to have checkin set a timer for disconnecting after, say, a few minutes. However, then you’d also want for any new checkouts from the pool to prefer connections that are already open, and that might be trickier. Also. Sojourn doesn’t actually seem to call checkout in DBConnection.Connection.

As for Poolboy, it doesn’t seem possible to grow (or shrink) the pool after it’s already started.

Sojourn appears to be more flexible, maintaining dynamic queues with timeouts for both available connections / workers and client requests - it just isn’t used in a way that allows lazy connections at the moment.

@josevalim, do you think I’m missing something or is the above summary accurate, more or less?

josevalim · May 13, 2016, 8:04am

That sounds accurate but let’s copy @fishcakez to be double sure.

jwarlander · May 13, 2016, 10:53am

Heh… I’ve been digging a bit deeper to try and understand how I could make this work with Sojourn. It turns out the full Ecto.Repo + DBConnection startup logic is a deep rabbit hole…

However, what I got out of it (for anyone curious about this) was basically that DBConnection.Sojourn.Starter will fire up the requested # of DBConnection.Connection processes. These will in turn connect to the database, and if queue = :broker (which it is for Sojourn) then each connection will make an asynchronous call to itself in order to register with the Sojourn broker and wait for a matching client request.

Upon idle timeout in Sojourn, the connection is removed from the queue, and a drop message is sent to DBConnection.Connection; this is handled by pinging the database connection, and if all ok, re-registering with the Sojourn broker.

jwarlander · May 13, 2016, 12:05pm

One approach I could imagine, would be something like the following:

Make DBConnection.Connection.handle_broker({:drop, _}, s) actually close the db connection instead of re-queueing
- Probably means we’d have a longer timeout in Sojourn, from maybe 5 - 60 minutes
Change DBConnection.Sojourn.checkout/2 to try spinning up a new connection if we’re still below limit
- So it would need to find out 1) current queue size, and 2) the pool max size
- …or maybe delegate this to DBConnection.Sojourn.Starter, keeping it alive and exposing an interface for this

I’m sure it’s naive in one or more ways, but in my defense part of my mind is still lost in a maze of twisty little passages, all alike, from the above mentioned deep dive

fishcakez · May 13, 2016, 12:53pm

DBConnection.Poolboy has the :pool_overflow option (sorry if this is not documented) that gets renamed to :max_overflow before going to poolboy. This allows the pool to grow by :pool_overflow above :pool_size when client processes are queued waiting for a worker. However poolboy will close overflow workers once the client queue is empty. This can lead to churning through connections because a worker can get checked in, stopped because the pool is in overflow with an empty client queue and then restarted when a client checkout request arrives.

DBConnection.Sojourn calls checkout immediately and then never checks back in, effectively the broker “owns” the socket and not the connection process because the queue is in the broker. It would require additional complexity and message passing to checkout every time.

Lazy connections in the Ecto 1.* had a weak spot. If the database went down or there was a network issue then poolboy’s queue would fill up because every client process that checked out a worker process would wait for the connection attempt to timeout (perhaps 3 seconds). This could result in other client processes having to wait in poolboy’s queue because there isn’t enough workers to block for 3 seconds on each check out. Very quickly the poolboy queue could grow such that every checkout request spends approximately the pool timeout (perhaps 5 seconds) in the pool queue because either the checkout timeouts after 5 seconds or gets a worker because the process in front of it timed out. This means we have a situation where everything takes 5 or 8 (5 + 3) seconds to fail.

However in Ecto 2.0 we try to connect at intervals, independent of client check out requests, when we can can not connect to the database. If a checkout is attempted on a disconnected process then the checkout fails immediately so we can quickly empty the poolboy queue when requests are going fail (because there is no connection to database).

Unfortunately poolboy doesn’t know if a worker process is connected or not and so it can not favour connected processes. With the Ecto 1.* lazy strategy a worker will try to connect if it gets checked out. This works best with a LIFO worker queue so that the most recently used worker is checked out first and so limits the number of connected workers. With a FIFO worker queue the least recently used worker is checked out first, which means all workers are connected after :pool_size check outs.

With the Ecto 2.0 strategy if we use a LIFO queue a disconnected process could be at the front of the worker queue, get checked out, immediately return an error, get checked in to the front of the queue, and get checked out again only to error immediately and return to the front of the queue. It’s possible that other workers are connected but they can’t make it to the front of the worker queue because load isn’t high enough. Hopefully the disconnected worker would get connected eventually but this could extend down time to the max reconnection interval (perhaps 30 seconds).

With sbroker in Ecto 1.* the lazy option was only used for the first checkout per worker. After that the Ecto 2.0 pattern was used. The primarily goal of sbroker is to “queue well”, which is best taken advantage of by only queueing connected workers. If a lazy connection is checked out the client has to wait for the connection to be established. However it is very likely if the client had waited a little longer in the queue an already connected worker would have become available - before the lazy connection could connect and handshake. The DBConnection.Sojourn pool can also be configured to use more advanced queuing techniques that avoid the 5 second long queue described above.

sbroker master has a job regulating process that is designed for resizing a worker pool using sbroker but DBConnection.Sojourn uses an older version (the most recent release). The resizing strategies will try to guess when a worker is required and then allow it to connect, once the worker is connected it gets queued. This means that in the situation in the previous paragraph the client waits in the queue and may get an already connected worker or the newly connected worker it would have got with a lazy connection – the ideal situation.

The regulator guesses when to allow another worker to connect by sampling the time a worker waits for a client in the broker’s queue. When the workers aren’t waiting very long for a client the regulator allows workers to connect until the worker queue is slow. If the worker queue is slow the client queue should be fast. Sampling of worker’s wait time is used because we don’t want to allow more workers to connect when no workers can connect. If workers are naively spawned when a client is waiting for a worker it can create the situation where the pool grows to its maximum but all requests are failing, and once connections are established immediately shrinks. Or the situation above where a worker connects, handles a single request and then is closed on check in.

With sbroker master the pool can shrink when a worker is dropped from the broker’s worker queue. The worker checks what it should do with the regulator: either disconnect and wait or ping and re-queue. With this combination of growing and shrinking it turns out that the pool tends to grow slowly and shrink quickly with busty loads compared to the lazy connections that grow quickly and may shrink slowly (when using worker level idle time) or quickly (when using poolboy no clients waiting approach). However it is possible for the regulator to start growing the pool before the worker queue is empty so that it starts growing earlier but increases more gradually.

Unfortunately doing pool resizing well doubles the code base of sbroker and increases the complexity of both sbroker and projects using it. There is also an issue with churn when under loaded. However I think I know the fix, I didn’t implement 100% of the paper because it wasn’t needed previously. For DBConnection I am not sure if it’s worthwhile because if all nodes reach the maximum size the database is still in trouble. Fortunately postgres has PgBouncer that can proxy connections from many machines to a single postgres server without ever overloading the database. We even go to some trouble in postgrex to support Ecto 2.0 cached queries when using PgBouncer.

jwarlander · May 13, 2016, 1:31pm

Thanks for the thorough explanation!

I guess what it comes down to for us, is that apparently the standard way of doing connection pooling in the Java apps is to only keep connections open for a short while (up to an hour or so) after they’ve been idle. So we’re doing monitoring on long-lived idle connections, to see that the applications are healthy and behaving correctly. Obviously, the Phoenix web app we’ve now deployed kind of breaks that part of monitoring for the database instance we’re using

Since the DBAs now know about it, they’ll take it into account, but it’s making it harder for them to spot problems with other apps on the same db instance.

Ideally I’d like to be able to configure things like minimum size (to always keep connected, can optionally be 0), maximum size (like current pool_size, can’t grow beyond this), min / max idle connections (to set boundaries for when it should start up new connections in the pool, or close them down), or some other similar set of options to achieve similar behavior as for our other apps.

What you describe above, with sbroker master + the regulator, sounds like it would work even more intelligently, avoiding the need for setting explicit min / max idle connections (since it would sample usage and determine on its own what was needed). Just as you say, it’s not ideal if the client has to wait for a connection to be established, so one of the two options (manual configuration of min / max idle connections or a self-regulating approach) would definitely be a lot better than a more naive approach.

What I did as an initial measure is to minimize our pool size, setting it to 5 for now, and allow up to 15 overflow connections. As you mention, if we actually start using the overflow connections, it’ll mean a lot of churn and unnecessary delays, but this app so far doesn’t see a lot of use by more than 1-2 persons at a time, so it should be fine.

fishcakez · May 13, 2016, 2:57pm

I will investigate a new release of sbroker today and see if we can use it in DBConnection and Ecto 2.0.

fishcakez · June 5, 2016, 11:35pm

There is a PR for adding support for dynamic resizing of the pool: https://github.com/fishcakez/db_connection/pull/41

jwarlander · June 7, 2016, 11:42am

I saw that it was recently merged, awesome!

Is there any way I can try it out in my existing project?

fishcakez · June 7, 2016, 1:09pm

@jwarlander I did this very quickly for you so I am sorry if it doesn’t work: https://github.com/fishcakez/ecto/tree/jf-db_conn_1_0. You will need to make sure your deps match the lock file there.

jwarlander · June 11, 2016, 8:44pm

Thanks! I’ve just tried it out, but end up with an error during application startup:

Consolidated IEx.Info
Consolidated Inspect
[info] Running Wolf.Endpoint with Cowboy using http on port 4000
[error] ** :sbroker_meter :sregulator_underload_meter failed to install.
** Was installing in {DBConnection.Sojourn.Regulator, #PID<0.4147.0>}
** When arguments == {500, 5000, {:underload, Wolf.Repo.Pool}}
** Reason == {:badarg,
 [{:gen_event, :send, 2, [file: 'gen_event.erl', line: 232]},
  {:sregulator_underload_meter, :init, 2,
   [file: 'src/sregulator_underload_meter.erl', line: 68]},
  {:sbroker_handlers, :meters_init, 5,
   [file: 'src/sbroker_handlers.erl', line: 248]},
  {:sbroker_handlers, :meters_init, 4,
   [file: 'src/sbroker_handlers.erl', line: 238]},
  {:sregulator, :do_init, 6, [file: 'src/sregulator.erl', line: 639]},
  {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}

[info] Application wolf exited: Wolf.start(:normal, []) returned an error: shutdown: failed to start child: Wolf.Repo
    ** (EXIT) shutdown: failed to start child: :sbroker
        ** (EXIT) an exception was raised:
            ** (MatchError) no match of right hand side value: {:error, {:shutdown, {:failed_to_start_child, :regulator, {:badarg, [{:gen_event, :send, 2, [file: 'gen_event.erl', line: 232]}, {:sregulator_underload_meter, :init, 2, [file: 'src/sregulator_underload_meter.erl', line: 68]}, {:sbroker_handlers, :meters_init, 5, [file: 'src/sbroker_handlers.erl', line: 248]}, {:sbroker_handlers, :meters_init, 4, [file: 'src/sbroker_handlers.erl', line: 238]}, {:sregulator, :do_init, 6, [file: 'src/sregulator.erl', line: 639]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}}}}
                (db_connection) lib/db_connection/sojourn/broker.ex:92: DBConnection.Sojourn.Broker.start_pool/2
                (db_connection) lib/db_connection/sojourn/broker.ex:74: DBConnection.Sojourn.Broker.init/1
                src/sbroker.erl:632: :sbroker.init_it/6
                (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
[info] Application eldap exited: :stopped
[info] Application comeonin exited: :stopped
[info] Application postgrex exited: :stopped
[info] Application db_connection exited: :stopped
[info] Application connection exited: :stopped
[info] Application phoenix_ecto exited: :stopped
[info] Application ecto exited: :stopped
[info] Application poolboy exited: :stopped
[info] Application decimal exited: :stopped
[info] Application gettext exited: :stopped
[info] Application cowboy exited: :stopped
[info] Application cowlib exited: :stopped
[info] Application ranch exited: :stopped
[info] Application phoenix_html exited: :stopped

=INFO REPORT==== 11-Jun-2016::21:37:00 ===
    application: logger
    exited: stopped
    type: temporary
** (Mix) Could not start application wolf: Wolf.start(:normal, []) returned an error: shutdown: failed to start child: Wolf.Repo
    ** (EXIT) shutdown: failed to start child: :sbroker
        ** (EXIT) an exception was raised:
            ** (MatchError) no match of right hand side value: {:error, {:shutdown, {:failed_to_start_child, :regulator, {:badarg, [{:gen_event, :send, 2, [file: 'gen_event.erl', line: 232]}, {:sregulator_underload_meter, :init, 2, [file: 'src/sregulator_underload_meter.erl', line: 68]}, {:sbroker_handlers, :meters_init, 5, [file: 'src/sbroker_handlers.erl', line: 248]}, {:sbroker_handlers, :meters_init, 4, [file: 'src/sbroker_handlers.erl', line: 238]}, {:sregulator, :do_init, 6, [file: 'src/sregulator.erl', line: 639]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}}}}
                (db_connection) lib/db_connection/sojourn/broker.ex:92: DBConnection.Sojourn.Broker.start_pool/2
                (db_connection) lib/db_connection/sojourn/broker.ex:74: DBConnection.Sojourn.Broker.init/1
                src/sbroker.erl:632: :sbroker.init_it/6
                (stdlib) proc_lib.erl:240: :proc_lib.init_p_do_apply/3

My changes were basically:

diff --git a/config/dev.exs b/config/dev.exs
index a35dc43..5081046 100644
--- a/config/dev.exs
+++ b/config/dev.exs
@@ -39,4 +39,5 @@ config :wolf, Wolf.Repo,
   password: "REDACTED",
   database: "wolf_dev",
   hostname: "localhost",
+  pool: DBConnection.Sojourn,
   pool_size: 10
diff --git a/mix.exs b/mix.exs
index 5dc5327..09efcbc 100644
--- a/mix.exs
+++ b/mix.exs
@@ -42,7 +42,7 @@ defmodule Wolf.Mixfile do
   # Type `mix help deps` for examples and options.
   defp deps do
     [{:phoenix, "~> 1.1.4"},
-     {:postgrex, ">= 0.11.1"},
+      #{:postgrex, ">= 0.11.1"},
      {:phoenix_ecto, "~> 3.0.0-rc"},
      {:phoenix_html, "~> 2.4"},
      {:phoenix_live_reload, "~> 1.0", only: :dev},
@@ -50,6 +50,12 @@ defmodule Wolf.Mixfile do
      {:cowboy, "~> 1.0"},
      {:comeonin, "~> 2.0"},
      {:exrm, "~> 1.0"},
+     # lazy pooling
+     {:ecto, github: "fishcakez/ecto", branch: "jf-db_conn_1_0", override: true},
+     {:db_connection, github: "fishcakez/db_connection", override: true},
+     {:postgrex, github: "fishcakez/postgrex", branch: "jf-db_conn-1_0", override: true},
+     {:sbroker, github: "fishcakez/sbroker", tag: "v1.0.0-beta.1", override: true},
+     # test / dev
      {:wallaby, "~> 0.4", only: :test},
      {:excoveralls, "~> 0.4", only: :test},
      {:credo, "~> 0.3", only: [:dev, :test]},

Haven’t had time to dig deeper (on vacation right now)… Not at all unlikely that I’m using older versions of something else that’s causing expectations to mismatch.

fishcakez · June 11, 2016, 10:16pm

@jwarlander ecto master uses newer versions, I suggest you use that and add :sbroker to your mix.exs :applications list.

edevil · May 15, 2017, 10:22am

It’s been almost a year, can someone tell me the current status of lazy pooling when using Ecto+Postgrex?

Is DBConnection.Sojourn still the way to go? In one of the previous responses it is mentioned that DBConnection.Poolboy has the “pool_overflow” option while DBConnection.Sojourn did not, so I’m a bit confused.

stocks29 · August 11, 2019, 11:43am

It would be really useful to allow connections to stay alive for 30-60 minutes after use and then eventually drop to zero when using with Aurora Serverless as that would allow the database to automatically pause and save $ in dev/test envs.

If it’s helpful at all, c3p0 is a popular java lib for doing connection pooling. It has some interesting properties which seem to work quite well. It seems like by being very explicit with the configuration we could achieve the flexibility we’re looking for. Here are some of the notable configuration settings:

initialPoolSize - Number of Connections a pool will try to acquire upon startup.

idleConnectionTestPeriod - If this is a number greater than 0, c3p0 will test all idle, pooled but unchecked-out connections, every this number of seconds.

maxPoolSize - Maximum number of Connections a pool will maintain at any given time.

maxIdleTime - Seconds a Connection can remain pooled but unused before being discarded. Zero means idle connections never expire.

maxIdleTimeExcessConnections - Number of seconds that Connections in excess of minPoolSize should be permitted to remain idle in the pool before being culled. Intended for applications that wish to aggressively minimize the number of open Connections, shrinking the pool back towards minPoolSize if, following a spike, the load level diminishes and Connections acquired are no longer needed.

minPoolSize - Minimum number of Connections a pool will maintain at any given time.

c3p0 config docs: https://www.mchange.com/projects/c3p0/#configuration_properties