Sorting DB entities having UUIDs

thojanssens1 · September 16, 2020, 6:20am

I insert a list of entities in the DB, say articles, for a user. Primary keys are UUIDs.

To sort the articles just as they were initially sent through the List, I tried to sort on inserted_at. It doesn’t work because these records all have the same value for inserted_at, as they were inserted at the same time.

I might have two options:

Use microseconds? If I insert that collection of articles, will the timestamp in inserted_at be guaranteed to have different microseconds?
Add an auto-incremented field if the microsecond precision would be a bad idea?

QuinnWilton · September 16, 2020, 6:41am

I don’t know your exact use case, but ULIDs could be worth looking into as an alternative to UUIDs. They’re essentially lexicographically sortable UUIDs that embed both a Unix timestamp and random data.

You still only get millisecond precision, but you end up with a stable ordering over them because of the random segment.

thojanssens1 · September 16, 2020, 6:42am

It makes me wonder then why that’s not the accepted default unique identifier instead of UUID in the dev world. Any very strong drawbacks that UUID is more used? Is ordering on an ULID slower than on say a timestamp field or auto-incremented integer?

QuinnWilton · September 16, 2020, 6:56am

ULIDs are a relatively new idea, and library support isn’t great in all languages. You also do leak timing info through the ULID, which isn’t always desirable.

There may be other drawbacks, but I’m not an expert, sorry.

LostKobrakai · September 16, 2020, 7:02am

There’s also http://gh.peabody.io/uuidv6/ as an alternative to ULID, but again it’s only proposed but not yet an officially accepted standard.

Generally however I’d suggest not needing to sort by (only) the primary id. You’re talking about articles. Articles usually have a published at date(time), so sorting should happen by that. To achieve a stable sort you can use the primary id as secondary sorting column.

thojanssens1 · September 16, 2020, 7:02am

If I insert a set of items into the DB at once, I think it might be within the same milisecond though. Not sure. Microsecond sounds safer.

fuelen · September 16, 2020, 9:31am

You can simply use ordering by multiple columns: [:inserted_at, :id]. If inserted_at column is the same then ordering by id will be applied, so you will always have strict order.

al2o3cr · September 16, 2020, 10:59am

Something to consider: the order of these things sounds important. Is it possible to update that order after inserting the records? Consider materializing that order with something like a position column.

chulkilee · September 16, 2020, 12:13pm

If the order matters, but the order is not a primary key, then it’s totally fine to have a separate field (e.g. serial of postgresql) for sorting purpose.

tfwright · September 16, 2020, 2:24pm

Agreed with @chulkilee. And if the order is a matter of business logic and not just display, I think using a separate field has other advantages as well because it’s more explicit. I think that would be my first choice here, even if it came with some speed cost.

outlog · September 16, 2020, 7:15pm

there is also snowflake ids: https://github.com/meetnow/snowflakex

eahanson · September 17, 2020, 2:16am

I have recently started adding a column called “seq”, which is an auto incrementing sequence, in projects that use UUIDs for ids. For most of the tables I deal with, I’m not concerned about an extra few bytes per row.

I use seq as a secondary sort to keep sorting stable. This is handy in a UI, and especially in a LiveView UI where the data might change a lot. Stable sorts are also handy in tests to keep them from being flaky.

Finally, having this column means that rows can be easily sorted by insertion order even if multiple rows are inserted simultaneously which is helpful in certain kinds of unit tests.

sribe · September 17, 2020, 2:49am

Well, now() returns the same timestamp for all calls in a transaction, so no, microseconds won’t help. (There is a call that gets wall clock time and so can vary through a transaction, but still not a good idea.) If you want an explicit order, add it–a new column.

christhekeele · September 17, 2020, 7:27am

Either embedding a timestamp or a sequence number into a guid requires co-ordination: an authoritative source from which to mint them (your app or db server).

One virtuous property of pure-random guids is that you can let clients generate their own primary ids for entities before submitting them to the persistence layer for creation. It’s a handy feature in certain scenarios that you lose if you go with an authoritative id minter.

In practice both UUIDv1, UUIDv6, and ULID specs dedicate enough entropy to the random bits that you still don’t really need the centralized authority, and UUIDv6 and ULID have the K-sortable property you want to make a good primary key with default sorting virtues.

As mentioned in this thread neither are universally accepted standards (yet), so aren’t as widely supported nor adopted as one would hope for one’s choice in primary key. I would bet on that changing in the next 5 years, however, as they are incredibly useful and likely the future of guids-as-distributed-system-pks! Maybe keep an eye on the UUIDv6 draft.