Is this usage of Repo.transact an anti-pattern?

Background

I am studying the Transaction Script pattern, namely from a book called “Learning Domain Driven Design”.

For those of you unfamiliar with this pattern, here is a small description:

The Transaction Script organizes business logic by procedures, where each procedure handles a single request from the Public Interface of the application, aka, the presentation layer.

In the book, the author explains that one of the limitations of this pattern is when you have to take an action that must be atomic across several different storage/communication mechanisms. Imagine updating the database and sending a message via a broker - for the system to remain consistent, both must be done atomically.

  @spec execute_v1(integer(), NaiveDateTime.t()) :: :ok | {:error, any()}
  def execute_v1(user_id, visited_on) do
    Repo.query!("UPDATE \"user\" SET last_visit=$1 WHERE id=$2", [visited_on, user_id])

    MessageBus.publish(%{user_id: user_id, visited_on: visited_on})
  end

Question

In most languages, ensuring this behaviour remains consistent and transactional would be impossible to do. In fact this is the main premise to introduce CQRS and the Outbox Pattern in later chapters.

However Elixir does have a workaround for this limitation, that can make sure this set of operations is atomic and consequently ensures the consistency of the system (or so I believe):

@spec execute_v2(integer(), NaiveDateTime.t()) :: {:ok, [Postgrex.Result.t()]}
  def execute_v2(user_id, visited_on) do
    Repo.transact(fn ->
      update_result =
        Repo.query!("UPDATE \"user\" SET last_visit=$1 WHERE id=$2", [visited_on, user_id])

      :ok = MessageBus.publish(%{user_id: user_id, visited_on: visited_on})

      {:ok, [update_result]}
    end)
  end

By using Repo.transact Elixir allows us to make sure this piece of code is atomic. If publishing of the message fails, the Database operation is rolled back. In this specific instance, because we only send 1 message, I believe this workaround would work just fine.

Even though I am using this workaround in this context, I would like to make it clear I have seen people use it with all sorts of things. A few come to my mind:

  • Sending HTTP messages after writing to the database
  • Doing database writes on multiple storage systems (usually relational DBs and non-relational DBs at the same time)

However, I always have the same questions.

  1. Is Repo.transact supposed to be used this way?
  2. Isn’t this considered an anti-pattern by the community?
  3. What are the alternatives to this workaround, if any?

I am also curious to know if anyone else reading the code thinks this can fail in a way I have not yet predicted. Please let me know!

That’s code example is not consistent. You could publish to the message bus but fail to commit the transaction. This becomes more obvious once you look at sql queries behind this:

BEGIN TRANSACTION;
UPDATE …;
-- publish to message bus;
-- consider a failure here
COMMIT;

This is no different in elixir than it is anywhere else. The problem stems from distributed computing, which doesn’t care at all about what language runs on individual actors.

4 Likes

This is very insightful. But I still do not understand how it could fail after publishing the message.

My code sample has no further instructions, so I assume what you mean is “Ecto can fail to commit the transaction, even if everything beforehand worked”, correct?

If so, could you give an example on how/why this can happen?

E.g. the database connection could drop. Another part of the system could make your ecto supervision tree restart, … There could be any number of reasons for the commit to not reach your database.

You’re correct that the elixir process executing your query is not doing a whole lot, but there’s many external factors you cannot control, but can very much prevent that commit.

2 Likes

I’m not sure why you arrive at this conclusion. In Django as an example, the entire request is wrapped in a transaction. This isn’t necessarily a good thing either because it can cause subtle, hard to find bugs.

I think I understand now. If I only have DB writes inside the Repo.transact then, even if the transaction fails in that specific area you mention, Ecto will still notice and fix it properly.

This however, becomes impossible in the case of sending a message, as Ecto does not know how to “fix a sent message”, nor is it possible to delete the message sent in many cases (think of RabbitMQ), only option being to “Send another message that nullifies the previous one” but this is opening a different can of worms.

Is this interpretation of your comment accurate enough?

I am not sure I follow your logic here. You are mentioning HTTP requests using a Python framework. The pattern I discuss is not related to HTTP or Web frameworks in general.

I understand you are trying to state that this same “transact” behaviour is also available in Python, correct?
As in, Python also have a transact function that allows you to do anything inside it, and that also has the same drawbacks?
If so, could you tell which function you are referring to, or how “Pythonists” usually use it? I am not a Python expert, but I would still like to see what you mean, even if its only for others to see your answer!

A request in Django is also just python code. At some point during a http request lifecycle Django starts a transaction, runs a bunch of python code and then commits the transaction or rolls it back. Django will only rollback DB changes, it won’t “unsend” an e-mail.

Ecto doesn‘t fix anything. Transaction cleanup is completely handled by the database. Hence it working for effects a transaction has on the database, but helping nothing for any sideeffects outside of the database.

So, if I were to be more precise, if a process crashes after the last instruction in Repo.transact, the transaction will fail, Ecto will notify the DB (maybe with a new replacement process or some other mechanism I am not aware of) and a rollback will happen. Right?

I don’t think so:

Under the hood postgres will realise that a connection has timed out and do the rollback.

1 Like

Lol you guys, this is not a matter of implementation details. What @LostKobrakai was trying to tell you is that solving this is literally impossible.

You can have at-most-once or at-least-once but you cannot have exactly-once in these situations.

You can make one side of the transaction idempotent and then guarantee at-least-once for that side. For example, you could commit the row to Postgres first and then repeatedly attempt to publish the message with some sort of idempotency id until it succeeds. There was actually a pretty good post about this on the Tigerbeetle blog recently.

As a side note, @dimitarvp and I had a discussion about something similar a while ago and I want to point out that this sort of thing is what I was talking about. Gluing multiple databases together can be messy and hard to understand, even if you’re knowledgeable in this area.

An alternative approach is to just have a really scalable multi-model database and use it for everything. The database can then worry about atomic commits for you. FoundationDB was a pioneer of this approach, and it’s one of the things I’m trying to do in Elixir with Hobbes.

4 Likes

This is something that Oban can help with. Extract the external side-effect code to a background job, and then enqueue the job inside the transaction.

If the transaction commits, the job is executed.

If the transaction rolls back, the job rolls back with it and is never executed.

One limitation of this approach is that it can’t handle the “roll back the transaction if the external request fails” scenario.

5 Likes

Not directly related to the problem of a «catch-all» recovery strategy, but take a look at the «Sagas» pattern, its quite nice :slight_smile:

Sage as an example :slight_smile:

I think I remember. It is indeed messy but at the time me and my team did not have the necessary expertise so we kind of glued a few things together. It worked with a few exceptions which we fixed the second-brain pattern i.e. have a complete audit log + a simulated log (meaning what should be there) and I authored a few background jobs that reconciled both.

I caught a subtle synchronisation bug that would have set the company back with 57K EUR.

I got a shoulder pat for my trouble, and a warning to not work on stuff that was not in the ticketing system. :laughing:

2 Likes

Yeah, unless someone does the enormous (ask me how I know) amount of work needed to serve as a base for another approach there is really not much you can do.

Ideally you just shove everything into Postgres as long as you can get away with it, as @al2o3cr alluded above. Postgres may have garbage consistency guarantees out of the box but at least its developers actually care about free software and won’t rugpull you at the earliest opportunity. And also at least it has a serializable flag even if nobody uses it lol.

If you need to hit an external API then you’re cooked but this is why good external APIs (e.g. Stripe) have idempotency as a first-class feature.

3 Likes