Postgres and guaranteed orderings - what is the correct way to do this?

garrison · April 4, 2025, 3:59pm

If we maintain strict serializability, the commit order matches the causal order, and there is no extra effort required to read messages in causal order because they are the same!

What you are proposing here is encoding causal information into each message and then handling that case-by-case in application code. This is an outdated approach (Dynamo did this, for example).

The problem is that you do not always know the causal link between messages.

I chose my examples very carefully to demonstrate this, and you have again chosen an example (chat message reply) that happens to be convenient for determining causality.

Here is a more detailed example showing why this does not work:

A doctor and a patient are collaborating on their medical records via cloud storage. The doctor sets the cloud drive to private, and then turns to the patient and says “ok, upload your medical records”. The patient connects, via their phone, to a different server and uploads their records.

If you observe these events out of order, private records will be exposed. These events have no clear causal link from the perspective of the database.

benwilson512 · April 4, 2025, 4:03pm

Hey folks, I have split this out from the post in the libraries section. I think this is a great discussion but as it is mostly focused on the abstract challenges of these sorts of systems and isn’t about the specific library posted in the other thread, I felt it important to split it out.

garrison · April 4, 2025, 4:13pm

You are making trivially falsifiable claims. There are a number of distributed systems which offer strict serializability. They have been around for decades now!

Here are four such systems with distinct architectures: Spanner, FoundationDB, FaunaDB (Calvin), and Etcd.

There is no reason a message queue cannot make ordering guarantees, and it is extremely helpful for it to make those guarantees.

If your claim is “there are use cases where I don’t need strict serializability” then that’s perfectly fine! But you can’t claim systems offering those guarantees don’t exist when they clearly do!

giddie · April 4, 2025, 10:14pm

Assuming that there is some final, central authority on whether or not the drive is private (e.g. a single database row), I think it’s important that the doctor’s interface show that the task is complete only once the update to that central authority has been committed. At that point, the system can guarantee that the patient’s data cannot be public, unless the doctor turns to the patient before waiting for confirmation.

But if this system leans heavily into eventual consistency, for instance, I can see this being a concern. In this case maybe the system would acknowledge receipt of the instruction to make the drive private, without indicating when exactly it will take effect. That would be quite a poor design in this scenario, I think, but this approach is not unusual for social media, for instance.

So personally, I’d say this is more of a general event-driven system semantics problem than a message queue ordering problem.

jstimps · April 4, 2025, 10:44pm

I feel compelled to share this example of a toy message queue implemented using FoundationDB’s strict serializability. It’s not directly relevant to the thread topic, but FDB was mentioned and so the contributors in the thread might find value in it.

https://hexdocs.pm/erlfdb/kv_queue.html

a3kov · April 5, 2025, 3:08am

What is the correct order for chat messages ? For log records ? For sensor measurements ? Commit order ? That simply doesn’t make sense.

It actually makes everything easier, because if you look at message queue cases, most events have natural order and timestamp. Simply because message queue messages are immutable pieces of information about events that already happened. The events have natural order - and like I said, I care about maintaining that order.

You are assuming that you always have messages arriving in the system in the correct order, but that is not the case usually. That’s why usually you need reordering based on natural order.

If you look at the use cases for message queues actually most of the time you already have the correct order of events.
You keep on bringing up contrived examples that have nothing to do with message queues.

Sounds like you need a database, not a message queue. You can build a message queue on top of database, but you should not expect message queue to have properties of a database.

These are databases. You don’t understand the difference between database and message queue ?

I already agreed you can easily implement any order in your system if you have single writer with it’s own buffer. That situation is very common and there’s nothing wrong with it, but one could argue in that case you are not implementing order inside the message queue, but outside of it. Because the writer does all the job.

Now, excuse me, I will not post here any more, because it seems like the discussion becoming increasingly pointless. I’m also not interested in discussing how a message queue should work like a database. After the split it seems like I’m discussing Postgres (which is not very fair to me), but that’s not true at all - all my posts here were made in the context of implementing message queue on top of a db.

benwilson512 · April 5, 2025, 1:03pm

I’m would strongly suggest taking a step back and reading Strong consistency models. Your posts here don’t indicate a familiarity with the terminology and discussions in this area. You’re also trying to refute statements made in the form of “there exists” with other statements of the form “there exists” and that just doesn’t work. For example:

No, he isn’t. He’s saying there exists messages that can have a deterministic order. This is achievable with vector clocks, various acknowledgement schemes, and so on. The bit where systems also exist with indeterminate origin order does not falsify his claim.

Your focus on chat is the bit that seems needlessly contrived. As I pointed out much earlier in this thread, there is a whole category of message queue use cases where they are a mechanism of inter system communication. In such cases tools exist to provide the exact set of guarantees you seem skeptical about, and those guarantees can be critical for the proper function of the system.

Again this belies a lack of understanding about how these systems relate to one another. How does Postgres work internally? The WAL. What is the WAL? A log. Logs and message queues are extremely similar structures.

More to the point, the consistency models I linked at the top aren’t about “databases” or “message queues”. They’re about systems, and the observable properties of those systems with respect to operations you can perform. It is 100% valid to evaluate message queues under those models, in fact a large number of the Jepsen evaluations do!

Overall this is a super cool area of modern computer science and I would again just suggest taking a step back and going “maybe there is something new I need to learn here”.

garrison · April 5, 2025, 5:40pm

There is something fundamental and interesting to understand here: an event log and a “database” are equivalent. They are the same thing. This makes intuitive sense: you can build an event log from a database, and you can build a database from an event log.

In distributed systems, it is known that total order broadcast is equivalent to consensus. Think about what this means: deciding on an order of events is equivalent to agreeing upon a value.

If you agree on an order, you agree on the latest value in a log of changes. Likewise if you agree on a value, that “value” could be a log of changes!

Of the four “databases” I cited:

Etcd is built upon a log replicated with Raft (the purpose of Raft is, in fact, to replicate logs)
FaunaDB/Calvin are also built on a log (Calvin was replicated with Paxos), but they are designed to scale horizontally; they use the classic batching/latency tradeoff to achieve massive concurrency
FoundationDB is also built on a distributed log, and is quite similar to Calvin in some respects, but performs concurrency control differently (and some other differences)
Spanner is the odd one out: it is composed of many Paxos-replicated logs, and events are ordered between logs using tightly-synchronized clocks and a trick to wait out uncertainty at commit time

Of course there are also products which call themselves “message queues” which provide strict FIFO ordering, like AWS’s “SQS FIFO queues”, and I’m sure there are many others!

garrison · April 5, 2025, 6:05pm

If you were following along with my preceding replies the confirmation step was strongly implied, but I should have re-stated that explicitly. My mistake.

“The system” can only make the guarantees that it makes! You are correct that a strict serializable system would make this guarantee, but there are many systems which do not make this guarantee. The purpose of this example is to demonstrate the danger of such systems.

I’m afraid you do not need to fall all the way down to eventual consistency to run into these issues. Even just a small step down from strict serializable is enough to cause anomalies which are very difficult for the average developer to comprehend.

For example, CockroachDB, a popular distributed SQL database, provides serializable isolation with no stale reads. This is a much stronger guarantee than eventual consistency, and is actually very close to being strict serializable.

And yet, if you were to insert a “visibility policy” row into one table setting the drive to private, commit that, and in another transaction (on another node) insert a medical record, while concurrently attempting to read them both in a third transaction, it would be possible to observe the medical record without observing the policy. Not great!

CockroachDB has some tools which you can use to mitigate this, but to use them you have to understand the anomalies to begin with, and they are very, very hard to understand.

giddie · April 7, 2025, 1:44pm

Yeah, I think we’re reaching for the same thing here. This is what I meant by “some final, central authority on whether or not the drive is private”. If the system cannot offer strict serializability, then it cannot offer the kind of “central authority” I had in mind.

As soon as we switch from a central database to something distributed, then you lose the ability to reason about the database’s state at a given point in time. As I understand it, this is a physical reality – Relativity suggests that we cannot reason about a “global” time, because separate points in space experience time separately. The best we can do is build something that is fine with the uncertainty, such as CRDTs, etc…

For my purposes, I generally think about a system either offering guarantees about writes being immediately visible to all readers, or else it is essentially “eventually consistent” in the loosest sense of the term. And there are of course distributed databases that build on that “eventually consistent” base to offer guarantees that almost reach the same point.

So all that to say that yeah – distributed databases are complicated. But my PostgreSQL message queue was written pretty much with regular serializable transactions in mind. And if someone is looking into a distributed flavour of Postgres, or even regular replication, they should be smart enough to understand the implications for ordering.

giddie · April 7, 2025, 2:34pm

Just a small update to add that I saw this in CockroachDB’s documentation:

“No stale reads” means that, once a write transaction committed, every read transaction starting afterwards[^3] will see it.

This is certainly enough to satisfy our doctor + patient scenario. It explicitly offers a mechanism to guarantee that after commit, the write will be visible in subsequent transactions.

garrison · April 8, 2025, 6:18pm

I’m afraid it’s not so simple. The “final central authority” you are trying to describe sounds a lot like (single-key) linearizability, which is essentially the guarantee (along with serializability) that cockroachdb is making. It means (hand-waving) that when you read a key you see the latest version. A better explanation was linked above by @benwilson512 (the Aphyr article).

Unfortunately, once you have transactions things are not so simple. Linearizability is actually very easy to achieve in a distributed database (those which do not offer it are being exceptionally lazy). All you need to do is store a particular key in one place.

Strict serializability extends the “idea” of linearizability to the entire keyspace. It means that all transactions, over any arbitrary keys, are ordered as if they were atomic (serializability) and in an order consistent with “real-time” ordering (as if it were observed by omniscient clocks). Meaning that the order also, by definition, reflects real-world causality (the doctor talking to the patient).

CockroachDB does not offer strict serializability, which is why I used it as an example of where things can go off the rails when you compromise on consistency.

This is absolutely not “the best we can do”. It is possible to guarantee strict serializability in a distributed database. It has been done at scale in production (by Spanner and FoundationDB) for about 15 years now.

You are correct that we cannot violate the speed of light, but we can trace causality across distances through message-passing. A simple implementation (used by FoundationDB) is a so-called “timestamping server” which passes out monotonic versions from a central location.

Spanner effectively backchannels the communication (through GPS) and then tracks the passage of (short periods of) time using extremely accurate atomic clocks and blocks transaction commits for an “uncertainty period” which their engineers determined is unlikely to be incorrect. This allows them to order events without (direct) communication, and it is also extremely complicated, confusing, expensive, and (inherently) probabilistic.

benwilson512 · April 8, 2025, 6:36pm

I will say that to some extent we have gotten pretty far from even this threads original scope which was focused on a non distributed Postgres instance. There’s still gotchas there, but when we are down to a single node there’s a lot more options.

It’s also worth noting that single key serializability is also a common “good enough” spot for message queue style tools / services.

garrison · April 8, 2025, 6:43pm

It is most certainly not enough. I want to emphasize that this is exactly what I’m talking about when I say this is very confusing, and that it’s not your fault that you misunderstand these guarantees so much as it is their fault for providing such a confusing consistency model to developers. (Similarly, if someone pulls a handle labeled “push”, I blame the door.)

You are correct that reads within our “third transaction” are guaranteed to see the latest values. They are also, in addition to that, guaranteed serializability (which is not trivial), so they will see a view of the database consistent with some ordering of transactions, as if each transaction ran in isolation.

The problem is that because CockroachDB fails to guarantee strict serializability, it is possible for a (concurrent) third transaction to observe the medical record row without observing the visibility policy row.

From a theory perspective, this is consistent with the medical record’s write transaction being re-ordered before the visibility transaction even though it happened afterwards in real-time. This is called a “time travel anomaly” or a “causal reverse”.

If you don’t believe me, you can read the post I linked from their own blog describing this exact anomaly, though of course they went out of their way to choose an innocuous example (social media posts) so that it doesn’t sound like such a big deal

garrison · April 8, 2025, 6:56pm

Yeah, this is an off-topic thread anyway so it’s no big deal My original question was answered almost instantly anyway lol

Postgres is interesting because as long as you run in serializable mode it is actually strict serializable. Of course it is not fault tolerant, but that’s still nice. Unfortunately nobody actually runs it in serializable mode…

What has really surprised me is that it seems like the existence of distributed systems with very strong consistency guarantees is not known to most developers. All of this “eventual consistency” business was essentially propaganda from Google/Amazon in the early 2000s before they figured out how to scale consistent systems. Of course once Google figured it out in the late 2000s and early 2010s they shifted their narrative, but clearly the industry still has not recovered.

Amazon, for their part, have taken until literally this year to build a proper successor to DynamoDB (which for reasons beyond mortal comprehension is named Aurora DSQL). And despite using synchronized clocks like Spanner they have decided to commit to snapshot isolation! What can you do, I guess.

giddie · April 8, 2025, 8:37pm

To me, this sounds a lot like a “central authority” But yeah, clearly you guys have looked waay deeper into the specifics of distributed databases than I have. Everything I’ve heard so far is interesting, but seems to pretty much fit my existing mental model, which admittedly is a bit abstract.

This is really cool, but to me it still sounds like a system that “is fine with uncertainty”. Or maybe more accurate to say in this case that it’s trying to reduce that uncertainty to a really, really low margin in the expectation that it can just be ignored.

I’m really not seeing that in the article. They make the claim:

“No stale reads” means that, once a write transaction committed, every read transaction starting afterwards[^3] will see it.

And I take that to mean that so long as our “privacy” (first) transaction has committed, and we wait to begin any subsequent transaction until the commit is confirmed, we should be just fine. Am I really reading this wrong? Right now what I suspect is that you’re talking about all these things firing off concurrently, because their example of failed strict serializability involves a read during two concurrent writes. And I just don’t see that being a serious issue in practice.

In fact I’d say that for all but the most demanding systems, a regular “read committed” isolation level should do just fine for a message queue, at least for the queue data itself.

garrison · April 9, 2025, 6:36pm

Yes, absolutely! The distinction here is between a “central authority per key” (linearizability) and a central authority for all keys, together (strict serializability). CockroachDB is an example of a database which provides the former, and FoundationDB is an example of the latter.

If you can order all transactions in a way that follows real-time ordering, like I described, then you have achieved strict serializability (assuming you already have serializability of course). In that case these “time travel anomalies” go away.

Indeed, Google’s engineers obviously believe the probability of their clocks being out of sync by more than the bounds they have chosen is extremely low. Similar to how we say a UUID is “globally unique” because the chance of collision is astronomically small.

Google apparently considered it a serious enough problem that they spent what I would imagine to be an enormous amount of money developing Spanner and Zanzibar. Those systems, particularly Zanzibar (which builds on Spanner), were designed explicitly to solve this exact problem.

I did not get into it (and you can read their papers if you want to know more) but they also use these systems to provide externally-consistent snapshot reads as well, which are not strict-serializable (because they are stale) but still respect real-time ordering otherwise. They are very serious about it.

I would argue it’s best to think of weaker consistency levels as akin to manual memory management. Obviously there are times when you have to do it, but we’re really better off with garbage collection most of the time. It’s safer that way.

garrison · April 9, 2025, 7:31pm

In the example I gave, you first commit the “visibility=private” transaction, then begin and commit the “medical record” transaction after that. The order of these transactions is very important. If the record were to be somehow re-ordered before setting the drive to private, it would be exposed.

It is the third (read) transaction which is concurrent with the previous two. You cannot control the timing of the read transaction, it could be literally anyone trying to view the contents of the drive.

Because the third (read) transaction is concurrent with the other two (even though the other two are not concurrent with each other), it would be valid under strict serializability to serialize the third transaction anywhere with respect to the other two. As long as the write transactions are in the correct order, the system works as you would expect.

So, let’s say the visibility transaction is t1, the medical record write is t2, and the transaction to read the drive contents is t3.

The following serializations are valid under strict serializability:

[t1, t2, t3] - the read transaction observes a private drive
[t1, t3, t2] - the read transaction observes a private (and empty) drive
[t3, t1, t2] - the read transaction observes a public, empty drive

However, CockroachDB does not guarantee strict serializability. So what can happen is this:

begin t3;
read visibility -> public; # t3

begin t1;
write visibility=private; # t1
commit t1;

begin t2;
write medical record; # t2;
commit t2;

read medical record; # t3
commit t3;

This does not violate linearizability, because we always read the latest version of each key. It also does not violate serializability, because there is a valid serialization of these transactions where each appears to modify the database in isolation: [t2, t3, t1].

But note what has gone wrong: t2 and t1 have been re-ordered! So strict serializability has been violated, and the record exposed.

dimitarvp · April 9, 2025, 8:27pm

I have did my best to follow the conversation here and learned a few new things (and got reminded that I knew some others but have since forgotten about them) and I cannot help but thinking… if the DB’s transaction order becomes such a problem, why not put a message queue in front of it? As far as I am aware, Kafka et. al. don’t make mistakes with orders of messages.

giddie · April 9, 2025, 8:42pm

Ah! Yes, OK I see the problem you’re pointing out now. But if I understand correctly, this issue only appears in practice with CRDB if the two write transactions are unrelated:

In this particular example, if the schema of the hackernews comments table would contain a self-referencing foreign key constraint (asking the database to ensure that child comments reference an existing parent), then the “reading” part would have been ensured by the system.

So I certainly see your point that there are subtleties here: so long as we’re careful to ensure that drive privacy and drive writes have some relational connection (e.g. a foreign key constraint) in the database, this should not be a problem. But you’re right that it’s easy to overlook.