ConnectionCache: a crazy Ecto connection pool

cjbottaro · June 27, 2016, 10:28pm

Hi. I’ve been asking a lot about how to implement a custom Ecto connection pool and a custom adapter to accommodate said pool. Here is the result of all that:

tl;dr: We have a multi-tenanted application that has too many databases spread across too many servers for normal connection pooling to work (even with pgBouncer). We need a “connection pool” that can create connections on demand and try to reuse ones that have been recently used, while evicting ones that haven’t been used in a while (like an LRU cache).

So it works fine…

# config.exs
config :my_app, MyRepo,
  adapter: Ecto.Adapters.PostgresWithoutCache,
  pool: Ecto.Pools.ConnectionCache,
  prepare: :unnamed,
  pool_size: 2,
  host: "db01.internal",
  database: "shard01"

# in iex or some such...

# Creates new connection to "default" database as specified by config.exs
MyRepo.get(...)

# Uses cached connection to "default" database
MyRepo.get(...)

# Creates a new connection to db02.internal/shard50
MyRepo.set_database(host: "db02.internal", database: "shard50")
MyRepo.get(...)

# Uses cached connection to "default" database
MyRepo.set_database(:default)
MyRepo.get(...)

# Creates new connection to db03.internal/shard99
# and closes/evicts the connection to db02.internal/shard50
# since our pool_size is set to 2 and it was least recently used.
MyRepo.get_database(host: "db03.internal", database: "shard99")

But… I’m new to Elixir and OTP. I have no idea if my code is any good or not…

Also, how Ecto connection pools work is very confusing to me. I figured it would be as easy as implementing the DBConnection.Pool behaviour, but it was not obvious from the documentation.

Ok, so checkout/2 is supposed to return {:ok, pool_ref, module, state}… but what does any of that mean? After many hours IE.pry and IO.inspect and reading the source for the poolboy based pool, I figured out that module and state should come from the return value of DBConnection.Connection.checkout/2.

And that’s the part that’s really throwing me off. I though my connection pool was supposed to manage connections to the database… but it looks like it’s managing other connection pools (each of size 1).

Also, these connections / connection pools are started outside a supervision tree; which that’s how the poolboy based pool works, I just copied it:

pid = DBConnection.Connection.start_link(module, connection_options)
# Then store the pid in my pool's GenServer state.

So everything generally works, but I have no idea if I’m doing things “right” or not. Nor do I really understand how Ecto expects connection pools to work:

What is a connection? It looks like a connection is it itself a pool. So a connection pool is pooling other smaller connection pools? Why?
What does Ecto expect checkout/2 to return? Both conceptually and also as in “what interface”?
Why aren’t these “connections” started in a supervision tree?

Anyway… just wanted to share my first kinda deep dive into Elixir code. I seriously doubt anyone will have a similar use case, but it would be super awesome to get any feedback.

Thanks!

fishcakez · June 30, 2016, 4:30pm

A DBConnection.Connection is a “pool” of 1 “socket”. This “socket” and its module are handed to the client process (process that calls checkout) so that it can use the socket directly instead of making calls to the connection process. This means that socket state is controlled by a single process. This prevents multiple processes getting into a different state, for example where the client believes it is inside a transaction but the connection process has closed the connection and reconnected.

When writing a DBConnection.Pool that uses DBConnection.Connection you want to write a pool of DBConnection.Connection processes and have the client checkin/checkout from DBConnection.Connection rather than the pool process. Otherwise the pool is blocked when checking out from a single connection. One example where blocking can occur is when the connection is first started. The start_link is async but then the process immediately blocks to connect and handshake. So if starting and immediately checking out this will slow down the pool. Especially if churning through different databases.

The method of setting the database leaks pids because the pid to database map is never cleaned up when a pid exits. This seems an ideal use case for the process dictionary. The database could be stored in the process dictionary of the client, and then fetched when doing the checkout and included as an option in the checkout options keyword.

The pool doesn’t track clients, so if a client exits abruptly (i.e. due to an exit signal) the DBConnection.Connection will never get checked in. You may want to copy the poolboy implementation more vigorously as there might be other cases where a client may fail to checkin, for example the sync_stop is not quite the same.

A DBConnection.Pool is expected to pass the integration test suite included with db_connection. I don’t think this pool will completely pass the suite because of the above reasons.

cjbottaro · July 3, 2016, 3:04am

Thank you so much for all that great info. I’ve given it another shot considering your advice and this is where I am now:

$ mix test.pools
==> Running tests for MIX_ENV=connection_cache mix test
Excluding tags: [:pool_overflow, :queue_tests, :queue_timeout_exit, :enqueue_disconnected, :dequeue_disconnected, :queue_timeout_raise]

..............................................................................

Finished in 1.6 seconds
87 tests, 0 failures, 9 skipped

Am I right in thinking those test should be skipped given the semantics of my “pool”? Which are:

Checkout will try to use an available (cached) connection, but if none is available, it will create a new connection, regardless of specified pool size.

.

Yup, am aware of that: Reap / prune ETS entries · Issue #1 · cjbottaro/ecto_pools_connection_cache · GitHub
Will take care of that after integration tests.

Thanks again!

fishcakez · July 3, 2016, 3:26pm

If there isn’t a queue I guess the queue tests aren’t applicable . I think the pool_overflow test might be applicable though because you allow the pool to grow above pool_size. I am not sure, you would want to provide equivalent or more tests if not.