Ecto and CockroachDB TransactionRetryWithProtoRefreshError issue

I’m working on debugging a TransactionRetryWithProtoRefreshError error that comes up from time to time on an app using CockroachDB. The errors get logged as something like the following (truncated)

ERROR 40001 (serialization_failure) restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key ...

One of the suggestions for dealing with this error is to have the database client retry the operation. Is there a way to make Ecto retry a query?

Another possible way to deal with this error is to modify the query so we can flag the selected data so Cockroach knows it’s gonna get used as part of an update. Cockroach offers this FOR UPDATE syntax, which translates to queries that look something like this:

SELECT * FROM foo 
WHERE bar ='x' ORDER BY updated_at 
FOR UPDATE;

Previously, I had used the Ecto :hints option to pass other Cockroach-specific options like AS OF SYSTEM TIME to queries by modifying the Repo’s prepare_query/3 function to something like this:

  def prepare_query(_operation, %{from: %{hints: []}} = query, opts) do
    case Keyword.get(opts, :hints, []) do
      [] -> {query, opts}
      hints -> {put_in(query.from.hints, hints), opts}
    end
  end

That worked pretty well so we could do things like the following:

Repo.all(MySchema, hints: ["AS OF SYSTEM TIME '-5m'"])

See related thread

.

However, this doesn’t work when the query includes where clauses. In the logs, I can see that the query isn’t being built in a way that Cockroach needs. The query that gets built is something like this:

SELECT c0."foo", c0."bar", ...  
FROM "my"."table" AS c0 FOR UPDATE WHERE (c0."foo" = $1)

but the FOR UPDATE bit needs to be at the end, e.g.

SELECT c0."foo", c0."bar", ...  
FROM "my"."table" AS c0 WHERE (c0."foo" = $1) FOR UPDATE

Granted, this is probably what I get for trying to leverage the :hints option for things that it really wasn’t intended for. Is there any way to modify the existing queries outside of writing raw SQL?

Relevant CockroachDB docs:

1 Like

The existing code around Ecto.Query.lock produces SQL like that, but I don’t know if it works with Cockroach.

Nice! Thanks Matt! That seems to work. I modified my Repo.prepare_query/3 to include something like this:

  def prepare_query(_operation, query, opts) do
    query = case Keyword.get(opts, :lock) do
      nil -> query
      lock -> Map.put(query, :lock, lock)
    end
    {query, opts}
  end

and now when I pass the :lock option, I am seeing the proper queries generated and Cockroach is running them without complaint!

The trickier thing to debug is whether or not this actually fixes this issue…

I have never used CockroachDB (though I have spent far too much time reading about it…), but at some point you are going to run into transactions that require retries because the database is distributed and relies on clock synchronization for single-key linearizability, so if you perform reads/writes on the same key from different nodes within the clock skew interval it’s going to have to abort since it doesn’t know which came first.

In such a case you have to retry the transactions on the client because the SQL is conversational so there’s no way the database can reissue the transaction on its own. I don’t think Ecto has any provisions for retries, but you should be able to just wrap the transaction in a function and recursively call it if you get an error (perhaps with a counter to avoid looping forever).

It sounds like the problem you’re running into in this case, though, is simply a serialization error (i.e. the same you would get in Postgres if you used SERIALIZABLE), in which case taking a pessimistic lock on the row will just cause the transactions to block each other instead of aborting. So if you just SELECT FOR UPDATE and then UPDATE and COMMIT, you should be fine. If you’re taking multiple locks you have to be careful to take them in order or you’re going to get deadlock aborts instead, of course.

Yeah, we are expecting the need to retry queries eventually. Our first round of cleanups however is to get our code better organized so we’re making fewer queries etc…