Connection pool: simple description of how it works

casperkaandorp · February 11, 2021, 9:12pm

Hi people,

I have a couple of questions that have been bothering me for quite some time now, but I lack the expertise (and time) to find an understandable answer. I am not looking for a very detailed answer here. I just would like to understand how Ecto deals with stress in broad terms.

I have to deal with say 500 people who will use my Phoenix app concurrently in an experiment. I have written a similar application in Rails and it all went horribly wrong. I am happy to share the details if you want. I suspect that it went wrong because the state of a database table didn’t get updated fast enough. The problem could not be simulated with ‘simple’ tests, it only occurred during a high load.

So here’s what bothering me: the said 500 folks start hitting the database like they want to destroy it. I have a connection pool of 25 connections. Imagine the Phoenix app has to deal with 500 queries in a very short timeframe. The first 25 of these 500 get a connection and are being processed. A number of these 25 are update queries on several records in the users table. Now imagine one other fast query has been processed and now there is room for query 26 which is a select query on the users table. But not all update queries are finished.

My questions:

Do I understand correctly that a single queue is used in Ecto to deal with the 500 incoming queries?
Would query 26 get a ‘wrong’ state of the database (it came in as query 26, assuming all update queries were done, but they are still being processed).
Is there anything smart that I missing here in terms of procedure. Maybe query 26 is put on hold because other queries that deal with the same table are still processed…)

Let me know if you need more info, or if I didn’t described it well enough.

Thanks in advance.

Casper

al2o3cr · February 11, 2021, 9:38pm

This is something that can be accomplished with correct use of transactions - but the techniques are about generating the right SQL, so it’s not really a question of what language/framework you use.

casperkaandorp · February 11, 2021, 10:15pm

Thanks for the quick reply!

If I understand correctly transactions are just packages of query that are processed as a whole. Never worked with them. That sounds like I have to write extra code that acts as a referee, making sure I only put queries in a package that won’t cause issues. Feels complicated in an environment in which many things happen at once.

About the language/framework thing: I know you are right, but I hope you agree with me that especially Rails has put a big layer of software on top of the database. The Rails problem I have mentioned was probably caused with these callbacks you can invoke after an insert or update. I assumed they were processed first thing after the update. Anyways, my point is that it feels I am not dealing with the database directly, and that the framework has an impact on how I deal with the database.

My no1 question still stands: is there a single queue in Ecto that deals with all incoming queries.

thanks

Cspr

dimitarvp · February 11, 2021, 11:38pm

Depends what you mean by “single queue”. All requests are done in parallel so if not all updates are finished by the time a query is executed then that query will only see the updates that have finished before it. It will not see the updates that are still ongoing. Standard DB stuff.

The state is not wrong because the update queries haven’t finished yet. The query will only see the results of the finished updates. That’s not “wrong”, it just might not be what you want in this scenario. But the DB is working as intended.

It seems that you want that to be the case. If that’s true then you should reach for Ecto.Multi and use it to execute a certain batch of updates and finally run a query that “sees” the results of all of them. But this is an arbitrary high-level requirement that the databases are not going to handle for you.

Feel free to tell us more about your requirements. I am getting the vibe that you want to disallow queries until certain amount of updates are finished first, is that correct?

al2o3cr · February 11, 2021, 11:44pm

The problem was probably because you wrote software that concurrently modifies the database but don’t understand how to use transactions to maintain consistency.

For instance, let’s imagine a simple “survey” application. It has a table of questions, and counts the number of times users click on a particular response. It uses a simple schema:

defmodule DemoApp.Question do
  use Ecto.Schema
  import Ecto.Changeset

  schema "questions" do
    field(:body, :string)
    field(:votes, :integer)
  end

  def changeset(question, attrs) do
    question
    |> cast(attrs, [:body, :votes])
  end
end

This controller will fail under sufficient load:

defmodule DemoAppWeb.AnswerController do
  use DemoAppWeb, :controller

  alias DemoApp.{Repo, Question}

  def create(conn, %{"question_id" => question_id}) do
    question = Repo.get(Question, question_id)
    
    question
    |> Question.changeset(%{votes: question.votes + 1})
    |> Repo.update!()

    text(conn, "OK")
  end
end

Under high load, this will lose updates. The cause is the time-separation between reading the old value of votes and writing the incremented value; if two requests arrive at nearly the same time, they will both read the same row and both try to write the same value (losing one increment).

There are several approaches to resolve this; they all have tradeoffs that need to be considered carefully:

specific needs (like incrementing) can be handled at the database-level; for instance this article describes using UPDATE x = x + 1 ... operations in SQL, which avoids the lag between the reading and writing by combining them into a single database operation.
more generally, if colliding writes are infrequent, optimistic locking can prevent the lost update, at the cost of one operation needing to be retried. If there are lots of collisions, this gets expensive.
pessimistic locking via SQL’s SELECT ... FOR UPDATE is pretty close to “code that acts like a referee”; any other transaction that tries to also lock that row will block and wait. It’s implemented with Ecto.Query.lock, something like this:

  def create(conn, %{"question_id" => question_id}) do
    import Ecto.Query

    Repo.transaction(fn ->
      question =
        Question
        |> lock("FOR UPDATE")
        |> Repo.get(question_id)
    
      question
      |> Question.changeset(%{votes: question.votes + 1})
      |> Repo.update!()
    end)

    text(conn, "OK")
  end

This controller action will NOT fail under high load; the SQL inside Repo.get will block until other transactions that did SELECT ... FOR UPDATE have committed or rolled back before returning.

That’s also the downside of pessimistic locking - if process A locks row #1 and then wants to lock row #2, but process B holds the lock on row #2 but wants a lock on row #1: blammo, deadlock.

Ultimately, tools like Phoenix or Rails can help produce the SQL operations - but if the operations don’t have the correct behavior, they’ll just make it faster & easier to write wrong programs.

casperkaandorp · February 17, 2021, 8:34pm

Hi @dimitarvp and @al2o3cr, thank you for your responses. And my apologies for the late reply. Things were very very hectic around here. I have checked my Rails code and although I can’t associate it well with the example of @al2o3cr, a similar thing must have happened: straight after insertion of a user record I counted how many users were waiting by looking at their status. If I got 80 waiting people I could fill a network. I always assumed that only one person would trigger this situation, subject number 80, but apparently more people received a head-count of 80 at the same-ish time.

Anyhow, after the Rails debacle I went for Phoenix and did a couple of things that solved my earlier issues. The experiment isn’t database dependent anymore. I spin up a number of GenServers that solved many (possible) concurrency issues. The game is now played with data which is stored/read/written in ets-tables. I have another GenServer taking care of incoming people by using a queue. I have tried to make things nuclear proof and probably have taken too much precautions, but hey! it works well and is very fast.

As said, the primary database consists of ets-tables. But we need the data afterwards and I have chosen to use a regular database for permanent storage. Every mutation in the ets-table is also processed in the database in an async fashion: I add every mutation-query in a queue (in another GenServer) and take chunks of that queue and try to process them. If I find mutations on the same data-object in a chunk I remove the ‘later’ ones and put them back on the front of the queue. I fully realise that this is a but much, but it feels super safe. I hope you get the picture: hundreds of mutations are made in etc-tables in short timespans, and everything that was added/changed is also processed in a database, but in a much slower pace.

So what do you think? If I read your comments it feels like my paranoia isn’t that crazy. The way things work, with the connection pool, doesn’t guarantee me that the sequence in which things happened is processed in the db in that order. Using transactions -feels- like a lot of work and careful planning whereas now I can code without thinking about precise SQL and the connection Pool. Things can’t get any faster than an ets-table I think.

Listen guys, I wish I could be less vague. There is a chance that the platform that I use right now will grow into something bigger that more people will use to conduct experiments. I have taken an alternative route here (the ets tables), because I got scared of weird stuff happening in databases under a heavy load. I’d like to have a confirmation if it is all justified. If I am asked to write a very serious platform I don’t want to have a bad foundation. So what would you do? Think hundreds of people doing something in a network with each other, in real time, and you want to collect a lot of information… Good old database or stick with the ets tables?

Thanks for your patience,

Cspr

ityonemo · February 17, 2021, 8:49pm

That sounds like I have to write extra code that acts as a referee, making sure I only put queries in a package that won’t cause issues.

You don’t have to write extra code; Ecto “basically does this for you”. If you must fuse two operations and make them atomic, you can use Ecto.Multi. In this case, each checkout is is tied to the process that called it for the lifetime of the Multi, and everything will be bundled into a single transaction. If another process simultaneously attempts to checkout using Ecto.Multi, then those will be bundled into a different transaction.

If you are using Postgres, there is support for upserts, too, which will be more performant.

Finally, if what you are doing gets really crazy, you may want to consider CQRS, which separates “write” events from “read” events, and during the projection phase, you can spray your writes onto ETS table on all of your nodes as an “adaptive cache” of sorts. I wish there were a generic (ZFS-like) ARC cache for erlang/elixir for ETS (since cache invalidation is a “hard problem”)

dimitarvp · February 17, 2021, 9:07pm

Sounds like you did a really good job!

The only thing I would improve is to try and squash a big queue of writes into a smaller amount of DB updates, f.ex. if User#123 firsts adds data to their state but later on in the queue they remove it then both queued events cancel each other out and don’t need to hit the DB at all.

But that’s a job for a CQRS library (a la commanded) but that requires you to bend your code and scenario to a very particular usage pattern which, let’s be frank, is one of CQRS’s biggest weaknesses and I feel makes it much more niche than it could be if it imposed less restrictions. A discussion for another time.

Another thing I’d look at is an OLAP database like DuckDB but not sure if it’s easy to make Erlang/Elixir bindings for it.

Outside of any possible nitpicks however, you did this excellently.

al2o3cr · February 17, 2021, 11:42pm

The reordering sounds similar to what a database engine does with locking (and rollbacks) to preserve serializability when using transactions.

It’s not going to be easy to write good tests to verify that this hand-built solution is correct without understanding the theory.

Upside of GenServers: they can help tame concurrency problems, because they control access to data with a single process
Downside of GenServers: they can become a bottleneck, because they control access to data with a single process

If you have GenServers that interact in cyclic patterns (A messages B, B messages C, C messages A) it’s also easy to produce a deadlock that stops all the processes involved until somebody times out.

minhajuddin · February 18, 2021, 2:37am

I wouldn’t give up on postgres so easily. It can handle crazy loads. Also, in the first thread, I think you are confusing the ACID guarantees of a relational database with something that ecto/ActiveRecord control. Every time you run a SQL query on your database it is a transaction, you can run multiple queries in one transaction too (Ecto, allows this by using the Repo.transaction and Ecto.Multi). Activerecord has transaction support too:

ActiveRecord::Base.transaction do
  david.withdrawal(100)
  mary.deposit(100)
end

When it comes to database transactions, changes that haven’t been committed aren’t visible to other transactions and only become visible once they are committed.

---T1
BEGIN;
INSERT INTO users(id, email, login_count) (1, 'dan', 10);
-- T2
--- T3
COMMIT;

Any database transaction at T1 or T2 does not see the new record user#1 in the database till the transaction is committed which happens at T3. These are database level guarantees.

benwilson512 · February 18, 2021, 4:31am

While true, it’s critical to understand that Postgres’s default transaction level is merely READ COMMITTED, which means the following scenario is still possible, if you only use transactions:

User1                             User2

account = Repo.get(Account, 1)
account.balance #=> 10

                                  account = Repo.get(Account, 1)
                                  account.balance #=> 10
                                  account
                                  |> Account.update(account.balance + 10)
                                  |> Repo.update!

                                  COMMIT

account
|> Account.update(account.balance + 10)
|> Repo.update!
COMMIT

Both users concurrently read the value, and the one commits before the other. No uncommitted values are read. However, read / modify / write still results in a lost value. It is critical to use some form of locking, or upgrade the transaction isolation level if using a read / modify / write pattern where “last write wins” isn’t an acceptable concurrency resolution mechanism.