Ecto and CockroachDB TransactionRetryWithProtoRefreshError issue

fireproofsocks · March 8, 2024, 2:41pm

I’m working on debugging a TransactionRetryWithProtoRefreshError error that comes up from time to time on an app using CockroachDB. The errors get logged as something like the following (truncated)

ERROR 40001 (serialization_failure) restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key ...

One of the suggestions for dealing with this error is to have the database client retry the operation. Is there a way to make Ecto retry a query?

Another possible way to deal with this error is to modify the query so we can flag the selected data so Cockroach knows it’s gonna get used as part of an update. Cockroach offers this FOR UPDATE syntax, which translates to queries that look something like this:

SELECT * FROM foo 
WHERE bar ='x' ORDER BY updated_at 
FOR UPDATE;

Previously, I had used the Ecto :hints option to pass other Cockroach-specific options like AS OF SYSTEM TIME to queries by modifying the Repo’s prepare_query/3 function to something like this:

  def prepare_query(_operation, %{from: %{hints: []}} = query, opts) do
    case Keyword.get(opts, :hints, []) do
      [] -> {query, opts}
      hints -> {put_in(query.from.hints, hints), opts}
    end
  end

That worked pretty well so we could do things like the following:

Repo.all(MySchema, hints: ["AS OF SYSTEM TIME '-5m'"])

See related thread

.

However, this doesn’t work when the query includes where clauses. In the logs, I can see that the query isn’t being built in a way that Cockroach needs. The query that gets built is something like this:

SELECT c0."foo", c0."bar", ...  
FROM "my"."table" AS c0 FOR UPDATE WHERE (c0."foo" = $1)

but the FOR UPDATE bit needs to be at the end, e.g.

SELECT c0."foo", c0."bar", ...  
FROM "my"."table" AS c0 WHERE (c0."foo" = $1) FOR UPDATE

Granted, this is probably what I get for trying to leverage the :hints option for things that it really wasn’t intended for. Is there any way to modify the existing queries outside of writing raw SQL?

Relevant CockroachDB docs:

al2o3cr · March 8, 2024, 3:09pm

The existing code around Ecto.Query.lock produces SQL like that, but I don’t know if it works with Cockroach.

fireproofsocks · March 8, 2024, 4:24pm

fireproofsocks:

ERROR 40001 (serialization_failure) restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key ...

Nice! Thanks Matt! That seems to work. I modified my Repo.prepare_query/3 to include something like this:

  def prepare_query(_operation, query, opts) do
    query = case Keyword.get(opts, :lock) do
      nil -> query
      lock -> Map.put(query, :lock, lock)
    end
    {query, opts}
  end

and now when I pass the :lock option, I am seeing the proper queries generated and Cockroach is running them without complaint!

The trickier thing to debug is whether or not this actually fixes this issue…

garrison · March 9, 2024, 5:11pm

I have never used CockroachDB (though I have spent far too much time reading about it…), but at some point you are going to run into transactions that require retries because the database is distributed and relies on clock synchronization for single-key linearizability, so if you perform reads/writes on the same key from different nodes within the clock skew interval it’s going to have to abort since it doesn’t know which came first.

In such a case you have to retry the transactions on the client because the SQL is conversational so there’s no way the database can reissue the transaction on its own. I don’t think Ecto has any provisions for retries, but you should be able to just wrap the transaction in a function and recursively call it if you get an error (perhaps with a counter to avoid looping forever).

It sounds like the problem you’re running into in this case, though, is simply a serialization error (i.e. the same you would get in Postgres if you used SERIALIZABLE), in which case taking a pessimistic lock on the row will just cause the transactions to block each other instead of aborting. So if you just SELECT FOR UPDATE and then UPDATE and COMMIT, you should be fine. If you’re taking multiple locks you have to be careful to take them in order or you’re going to get deadlock aborts instead, of course.

fireproofsocks · March 11, 2024, 1:40pm

Yeah, we are expecting the need to retry queries eventually. Our first round of cleanups however is to get our code better organized so we’re making fewer queries etc…

hansihe · May 21, 2024, 11:33am

For retries, we have a transaction_retry function in our Repo module.

  def transaction_retry(fun, options \\ [], max_retries \\ 3) when max_retries > 1 do
    transaction_retry_rec(fun, options, 1, max_retries)
  end

  defp transaction_retry_rec(fun, options, attempt, max_retries) do
    transaction(fun, options)
  catch
    # serialization_failure
    _kind, %Postgrex.Error{postgres: %{pg_code: "40001"}} = exception ->
      Logger.warning(
        "Transaction failed due to serialization error. Retrying... (attempt #{attempt} of #{max_retries})"
      )

      if attempt <= max_retries do
        sleep_ms = trunc(:math.pow(2, attempt) * 4 + :rand.uniform(10) + 1)

        Logger.info("Sleeping for #{sleep_ms}ms before retrying...")

        :timer.sleep(sleep_ms)

        transaction_retry_rec(fun, options, attempt + 1, max_retries)
      else
        Logger.warning(
          "Transaction failed after #{max_retries} attempts. Giving up. exception: #{inspect(exception)}"
        )

        reraise exception, __STACKTRACE__
      end
  end

When calling this you always have to keep in mind that whatever code you have inside this transaction can be executed multiple times, and should generally be without side effects.

If you want follower reads, those can also be specified at the transaction level. We have a function in our Repo for that as well:

  def set_transaction_follower_reads! do
    {:ok, _} =
      Ecto.Adapters.SQL.query(
        __MODULE__,
        "SET TRANSACTION AS OF SYSTEM TIME follower_read_timestamp()"
      )

    :ok
  end

This can be called at the beginning of a transaction in order to enable follower reads for the whole transaction:

Repo.transaction_retry(fn ->
  Repo.set_transaction_follower_reads!()

  # [...]
end)

fireproofsocks · August 9, 2024, 2:42pm

It turns out that the :lock option doesn’t put the AS OF SYSTEM TIME clause in the proper place.

A sample Ecto query might be something like

import Ecto.Query

limit = 100
query = from(c in MyTable,
    where: is_nil(c.foo),
    limit: ^limit,
    lock: "AS OF SYSTEM TIME '-5m'"
)

Ecto.Adapters.SQL.to_sql(:all, My.Repo, query)

This is the format of the query that works:

SELECT c0.foo, c0.bar 
FROM mytable AS c0  
AS OF SYSTEM TIME '-5m' 
WHERE (c0.foo IS NULL) 
LIMIT 10;

But this is how Ecto generates it :

SELECT c0.foo, c0.bar 
FROM mytable AS c0  
WHERE (c0.foo IS NULL) 
AS OF SYSTEM TIME '-5m' 
LIMIT 10;

The AS OF SYSTEM TIME needs to be BEFORE the WHERE statement.

garrison · August 9, 2024, 5:48pm

The lock is for pessimistic locks, e.g. lock("FOR UPDATE"). The hints keyword in from/2 is the documented way to add options after the from clause.

I see from your past posts (including in this thread) that you were doing this before. Was it not working? The Postgres support was cut into a release over a year ago.

BTW, this is possibly unrelated, but Cockroach doesn’t support AS OF SYSTEM TIME and FOR UPDATE in the same query. Which makes sense, I guess, since you can’t lock the past

andreaa · August 10, 2024, 6:08am

I didn’t know that CockroachDB worked with Ecto already. This is good to hear.

fireproofsocks · August 12, 2024, 12:52pm

Ecto works with CockroachDB for all the regular PostGres CRUD stuff and queries. Things get a little tweaky when we have handle the Cockroach-specific stuff like AS OF SYSTEM TIME etc.

andreaa · September 2, 2024, 10:59am

Ok. Thanks.

hansihe · October 9, 2024, 5:26pm

[snip]