Postgres and guaranteed orderings - what is the correct way to do this?

garrison · April 3, 2025, 5:19pm

Moderator note: This was split from Elixir-postgresql-message-queue - Pure PostgreSQL Message Queue for Elixir

What is the correct way to do this with Postgres? Lock a “counter” row and live with no concurrency?

Are there tricks you can do with LSNs or the WAL?

a3kov · April 3, 2025, 7:25pm

You should not depend on insertion order at all, as that is unpredictable in general. Messages should have built-in sequencing, usually own sequence or a timestamp or a combination of both.
I mean it’s not a problem at all if you use some form of insertion timestamp, but if you need strict ordering, you should treat it as “happened about that time”, no more.

LostKobrakai · April 3, 2025, 7:34pm

garrison · April 3, 2025, 7:45pm

Ah, their solution was exactly what I had in mind. Unfortunately your transactions would lose a lot of concurrency doing this, I think, but there is probably no other way.

If you use a timestamp from a clock you are back to square one, there is no guarantee that the clocks are synchronized. Even if you use the Postgres server’s clock I would not trust it to be monotonic across restarts (but maybe I’m wrong?). At least on the BEAM we have a monotonic clock, but again, only for one node (and again, restarts?).

(I think this is what you were trying to say but just clarifying.)

a3kov · April 3, 2025, 7:54pm

One of the biggest downsides in my opinion is if you generate your sequence on DB insert, you are at mercy of your DB connection pooler, your DB resources etc. Even if you use DB locks, there’s no guarantee that message 1 acquires the lock before message 2. It’s totally possible that message 2 acquires it first, and message 1 will do it much later. That is not even close to being off by some milliseconds of bad clocks.

garrison · April 3, 2025, 8:29pm

It is guaranteed that, once the first process receives its lock (once the INSERT/SELECT FOR UPDATE command returns), it will be serialized before the second process to receive that lock, with respect to the “sequence”.

This is literally the strongest correctness guarantee a database can provide, here. I don’t understand what behavior you expect.

Any number of milliseconds is still a consistency violation. You will experience anomalous reads if you take a clock approach. For example, you might see this:

foo | 100
bar | 200
baz | 300

And then read again and see this:

foo | 100
qux | 150
bar | 200
baz | 300

This is exactly the anomaly which prompted the discussion, because it could cause you to miss messages.

a3kov · April 3, 2025, 8:39pm

That is after you connect to the database. Which is not guaranteed in the correct, deterministic order in general.

Without atomic clocks it’s hard to even claim event2 happened after event 1 in a distributed system. What consistency are you talking about ? You need source of truth after all, and your database is a bad one, much worse than imprecise clocks.

garrison · April 3, 2025, 9:30pm

You are misunderstanding how consistency works in a transactional context. What is guaranteed, generally, is that if you commit one transaction and then, after that (from your point of view), start a second transaction, those transactions will be ordered to all observers in the order that you expect. This is, essentially, linearizability/strict serializability (I am waving a lot of hands here).

Postgres does not generally make these guarantees for all transactions (tragically), but in this case, with the lock approach, it would work out that way. Mostly.

Using the local clocks of the nodes you are starting the transactions from will violate those guarantees, and result in other processes observing events appearing out of order (this is just a straight up serializability violation, which is worse).

I will note that this discussion is further complicated by the fact that we’re talking about both the consistency guarantees of Postgres and the consistency guarantees of our message queue, which is a different system implemented on top of Postgres, at the same time.

But you are correct that you can’t guarantee events respect the order in which clients attempted to make connections from their perspective. Unfortunately, and I am not a physicist, but my understanding is that that order is not defined in our universe.

I think you will find that even with atomic clocks it is still very hard You can track causality with message passing, possibly through a central server (here, Postgres).

a3kov · April 3, 2025, 9:50pm

I understand it, I just think it’s not relevant. If you can’t even connect to the DB in a deterministic order, then automatically you can’t start transactions in order, or commit them.
For me it’s not about data consistency but about ordering - how do you decide which goes after which. And like I said, if your messages don’t have built-in source of truth, everything else is just waste of time. Just use some timestamp, even if it’s imprecise, and move on. It’s not going to be worse than DB counter, works much more predictably, is free in terms of resources and doesn’t require coordination.
If you want to observe correct order at any moment that is simply impossible - buffering is required to enforce correct order.

garrison · April 3, 2025, 10:04pm

?!

The order in which they commit is the order we’re talking about! It is very real!

It is very much worse than a DB counter, because you can miss messages if they are observed out of order. That is literally what we are discussing - how not to miss messages.

I can only assume that you are referring to something completely different than I am when you say “correct order”, because otherwise I haven’t the slightest idea what you’re talking about here.

a3kov · April 3, 2025, 10:16pm

Why would it matter if commits happen in random order ?
How a random counter affected by thousand factors better reflects reality than a timestamp, that has a fixed error margin ?

You need to have a buffer if you want to observe any kind of order (other than commit order, obviously). That means you can’t process messages immediately after storing them (in your DB scenario). For me it’s a different topic, related to ordering, but separate.
If the only thing you worry about is commit order - then I give up As for me, I worry about an order that reflects reality somewhat.

Let’s say, “reflects reality in the common sense”. That means if 2 users in real-time chat send a message, you treat these 2 messages as happening at the same time within small margin of error. For me if these 2 messages have margin of error = milliseconds, that is perfectly acceptable (of course, I’m talking about “the time server received the message”). If the gap between the timestamps of messages is significant (let’s say seconds), I would want the order in the DB (not necessarily observed at all times) reflect that. Preferably, in deterministic fashion

benwilson512 · April 3, 2025, 11:46pm

A big disconnect here is about whether you are ordering a set after the fact vs maintaining a cursor. If you have a set of messages and some error, then sure you can look at the set and say “this order is the order I will say it’s in”

With a cursor though that’s totally different. You are saying “I have seen messages up to time X”. If a message is created with a time before X you’re screwed since you will never see it at all.

EDIT: and to be clear I understand the OP to be focused more on this “systemic messaging” type system not “user messaging”. He’s making comparisons to RabbitMQ and such, where the goal is to have systems work together by sending messages and handling them. It’s critical to not miss messages, and it’s also generally critical to offer some means of guaranteeing order (perhaps scoped, perhaps global). These requirements can be difficult to satisfy jointly in Postgres without sacrificing a fair bit of performance.

garrison · April 3, 2025, 11:51pm

It does not matter if (concurrent) transactions commit in random order. What matters is that:

a) A transaction started after another transaction commits is ordered after it

b) Other transactions observe that order correctly

(Technically a implies b here, but I am stating them both for clarity.)

In order to preserve those invariants, the message version (timestamp) must match the commit order. Using a counter (carefully) in Postgres accomplishes that. Using the node’s clock does not.

If the timestamps do not match the commit order, another process will observe messages out of order or miss them entirely (depending on your implementation). If message 1 commits at t=100, and then message 2 commits at t=200, and then message 3 commits at t=150, an observer could observe [m1, m2] and then observe [m1, m3, m2].

That is very bad. It means that when we read a message there is no way to know if another message came before it, which means we must process messages out of order.

If we process messages out of order we could be susceptible to all sorts of issues, like for example the New Enemy problem.

Imagine we have something like an s3 bucket, implemented on top of our message queue. A user sets the bucket to private, waits for confirmation (commit), and then uploads sensitive medical data.

If a client were to observe those messages out of order, it could expose the private data publicly, which would be catastrophic.

garrison · April 4, 2025, 12:00am

The problem here is that you have chosen a bad example. Messages in a chat are effectively a CRDT (an append-only set). The order in which they are processed does not actually matter.

There are myriad cases where the order matters very much, like the example I gave above.

Another simple example: you kick a user out of your chat, and then post a message disparaging them. If you process those events out of order, you would be in trouble!

This doesn’t sound right to me, but I’m not sure I understand what you mean. What other order would you be observing, in a message queue, besides the order of messages?

I don’t know where you got the idea that your clocks have a fixed error margin, but unless your name is Google they most likely do not. And even TrueTime’s guarantees are probabilistic, they just believe the probability is very, very low.

benwilson512 · April 4, 2025, 12:13am

And importantly in the case of postgres, you could have perfect clocks and it won’t stop the “I’ve observed up to time X, therefore I’ll never get a message before time X” problem, since commit times allow for a ton of ordering issues.

dimitarvp · April 4, 2025, 12:18am

So there’s no way for the timestamp value inside a statement akin to INSERT ... inserted_at = NOW() to be calculated at the time of transaction committal and not at the time of the transaction start? Disappointing.

a3kov · April 4, 2025, 3:31am

If your messages have dependencies, you want the process that publishes them to insert them in the correct order.
If you can’t do that, the buffer is your only hope.
You can’t depend on DB message insert order if you do inserts concurrently.
Period.

It’s good example because it’s a message queue literally like in the title
The messages can also have dependencies, for example, they can be replies.

That’s why you need a buffer, like I said. Or, you insert the messages one by one, confirming every time, and say good bye to concurrency. Do I need to explain that in the latter case the inserting process must implement the same buffer ?

Besides the commit order you mean ? Any built-in order, like causality. Are you sure you are reading my messages at all ? You are literally describing causality and then asking me what kind of order I want besides commit order ?
Let me explain it the other way.
If a message B depends on message A, you don’t need to insert B after A (and neither it is always possible in a distributed system). You insert them concurrently, and maintain a buffer. Then the reader process will ensure ordering and correctness. And yes, that means that messages must contain required information in the message body or metadata.

That’s why I said, in the common sense. You don’t need perfect correctness, and neither it is achievable.

a3kov · April 4, 2025, 4:23am

I’m sorry to disappoint people who expect perfect ordering
But it’s simply impossible to achieve in real life. We need to have compromises.
If you have messages A, B, C, E, F, how long you want to wait for message D ?
Eternity ? Sorry but buffering and processing after-the-fact is the only reasonable thing you can implement in a distributed system. Eventual consistency (or no consistency at all)

I have a strong suspicion that some people here are confused about what a message queue is.
Message queue use case is usually about some event stream, that may have some order, but the events are immutable (you only need to process/react). Such as:

a real time chat
event log
sensor measurements
historical records of some sort
pub/sub system
etc etc

You would usually encode at least some ordering information in the message or its metadata. If it’s not there, that usually means it’s not very important.

Typical OLTP such as ledger is not a good use case for a message queue, because indeed, the commit order there is significant and transactions themselves can change depending on it. You would want strong consistency for that

benwilson512 · April 4, 2025, 11:38am

This is a very confusing conversation. You seem to be arguing that the guarantees provided by say Google Cloud Pubsub or by Fable (an event sourcing library I wrote in Postgres) are impossible. They aren’t, I use them daily. To be fair we also aren’t being all that rigorous about our definitions, so maybe that’s part of the issue.

To make it work though does require placing some limitations on both the event sender and the event receiver.

As a moderation note: this thread has long since diverged from discussing OPs library specifically, and I do think we should wait on him to reply regarding the capabilities that his approach does or doesn’t offer with respect to ordering. If we want to continue this thread about what’s possible in terms of message queue ordering guarantees broadly then I’ll make a new thread.

garrison · April 4, 2025, 3:45pm

Yes, of course you must insert them in the correct order!

If we use a counter (with an appropriate lock) in Postgres, we can guarantee that any process which reads the list of messages will see them in the order that they committed, because the version field (from the counter) will match the commit order (due to the lock).

If we use timestamps, we lose the guarantee that the timestamps follow commit order. As a result, it is possible to observe messages out of commit order. This makes everything much, much harder, and is the cause of many bugs.

Do you dispute this claim? If so, can you explain your reasoning?