Identifying users/content by 64 bit Int (like Twitter) vs. 128 bit UUID (string) vs. 64 bit random string? Isn't 64 bit Int or random string far more efficient?

mikejm · December 21, 2023, 7:59pm

I am looking at the best way to identify my users and content on a social type app (user profiles, media, chat messages, etc).

UUID vs. 64 bit Int

I don’t like sequentially numbered systems.

As I understand, GUID/UUID means a 128 bit int (16 byte) which is in practice stored as a 32 character string = 32 bytes.

This seems terribly inefficient. Since nothing can handle 128 bit int’s, what is the point in using them? You end up with double the storage and data transfer cost for them. Since you’re working in strings, I presume equality comparisons or matching becomes more costly as well.

Advantage of 64 bit Int:

I note Twitter uses 64 bit unsigned int’s for their post and user identification:

https://developer.twitter.com/en/docs/twitter-ids

Twitter is one of the biggest websites in the entire world. If they can manage this, why don’t more people do this? As long as you are checking for collisions against your database every time you generate a new ID for something, and you are doing this sequentially (calls not casts), then why not?

Even if you use 128 bit UUID’s you must still check for collisions and do it sequentially (birthday paradox), so what is the difference?

The benefit is a 64 bit int can be handled natively in C#, C++, numerous databases (my system does) and costs 8 bytes per instance (1/4 the cost of the UUID). Doing booleans on int64 is presumably going to be much faster than a 32 byte string (right?).

Even if you must transfer the ID’s to users as strings (JSON), the range of Int64 is +/- 9,223,372,036,854,775,808 which is 20 digits long. This as a string is max 20 bytes which is still cheaper at maximum length than the 32 byte UUID. Many random cases will be only 4-10 digits long (4-10 bytes).

64 bit Int in Elixir:**

Elixir I think can work with 64 bit int’s indirectly through Decimal:
https://hexdocs.pm/decimal/readme.html

Is Decimal highly inefficient for things like checking equalities or converting to JSON in some way that we should not want to use it this way?

I know Javascript doesn’t support 64 bit int either but I am not needing Javascript (and in cases where one did, you could deal with them as strings in that area so this wouldn’t break anything).

64 bit “UUID” Generation:

Besides Twitter, everyone I see is either using sequential numbering or 128 bit UUID, and I don’t see why.

I see some people thinking the same as me when I search: https://dba.stackexchange.com/questions/16040/what-is-the-best-identifier-for-a-userid-64-bit-integers-uuid-v5-or-64-char

Yet I see few to no thoughts on how to do this if so. One idea is just to use a completely random 64 bit int. Since Elixir doesn’t natively support these, would the idea be to create something like a Rust script for it and use Rustify to generate a queue of them for Elixir to pick from?

Twitter says they do theirs to be roughly sortable by time as follows: “To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number. Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).”

https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake

UUID vs. 64 bit Int vs. Random String

Another approach I see discussed here is using a random string: https://github.com/puid/Elixir. The article I found that from also agrees UUID (32 or 36 byte string) is highly inefficient for many of the reasons I said also: https://medium.com/@dingosky/understanding-random-ids-2768d137f02

If you are checking for collisions each time anyway, maybe this is still efficient and keeps data costs down. But perhaps it is still less efficient to work with than a 64 bit int?

Uniqueness Across Types?

The last question I wonder on this subject is if you are handling things like users, posts, media all with a 64 bit int, is there any need to enforce uniqueness across different data and object types? Ie. Can an abuse report have the same ID as a user profile or photo?

If your database is separating all these different types of data out already, and you are handling them distinctly, then my inclination is there is no need to enforce uniqueness across different data types, as that requires then multiple queries to the different tables also to ensure absolute uniqueness.

Thoughts?

Picking a good system for user/content identification seems important.

Any thoughts or ideas on best practice in this subject?

What are your thoughts?

derek-zhou · December 21, 2023, 8:10pm

Really? I thought the whole point of using 128 bit UUID is to have them generated distributively. You can’t do this with only 64 bit because the odds of collision would be real.

mikejm · December 21, 2023, 8:12pm

See the birthday paradox:

Unless your system has some capacity for appropriately dealing with multiple things potentially sharing ID’s you are some risk with using these and not checking. I have read extensively on it and while I’m not an expert (just from what I’ve read), where uniqueness is necessary, “just trust me bro” is not a certain convention to be safe on this.

derek-zhou · December 21, 2023, 8:16pm

Don’t trust me, trust the math. And someone did the math:

By the way, generating short random binary and check for collision is trivial in Elixir; I did it myself (only 48 bit my application is small). It seems like you are already convinced, so why not just do it?

mikejm · December 21, 2023, 8:24pm

I am asking because I am wondering if my logic is sound. I asked numerous questions in my post above as well in terms of specifics for implementation, such as how to work with the Int64 type in Elixir, best way to generate a random one, whether I am correct about the better efficiency, etc.

I am also wondering why more people are not using 64 bit Int’s given they seem to have many advantages and no significant drawbacks.

Why do people work with 128 bit UUID’s when they are highly inefficient, take far more data to store and transit, can’t be compared or worked with as efficiently, and the risk of collisions is low in any reasonable system?

I would never design a system that can’t handle a collision either way (which I presume requires either check before inserting to database for same ID already, or have database return error if trying to insert something with same primary key ID as something that exists).

For example, as discussed here: https://stackoverflow.com/questions/65484022/is-there-a-way-to-make-guid-100-collision-safe

The only way to GUARANTEE uniqueness without a sequential identity column is to look for the ID in the table first, any generate a new ID if it already exists. That’s not a free operation, of course, so you need to weigh the performance hit of that check versus the (infinitesimally small) probability of a collision. You might be better off just trying an insert, and if a collision occurs, catching the error and trying again.

derek-zhou · December 21, 2023, 8:27pm

I can;t speak for others, I use all 3:

sequential ids for local database primary keys
UUID for distribution
random short id with collision checking for ephemeral things

They are all fine and all have their pros and cons. There is no one true way to make ids.

mikejm · December 21, 2023, 8:31pm

The only advantage I can see to a UUID system is where presumably you have no central database that can easily check or return an error if you try to insert something with a primary key into a table that already has that primary key.

If you are storing identifiers into a main database that doesn’t already have an auto-incrementing int ID system (so you must manage your own ID, just like Twitter said they had to with Cassandra), then I can see no good reason to use UUID which is bulky and provides no clear added advantage over a shorter random string or likely best an Int64.

As long as the collisions are rare and they are handled (eg. database returns an error, you generate a new ID, and try again), you are paying no net penalty by using the shorter/faster ID type.

(I am not stating this as fact, I am just thinking about it out loud, as the broad convention of UUID’s makes no sense to me.)

al2o3cr · December 21, 2023, 8:36pm

I haven’t ever seen a system that stores UUIDs as VARCHAR(32) TBH; in my experience they are either old and use VARCHAR(36) (including the - characters) or new and use native UUID types (which store 16 bytes).

This is exactly why people don’t use 64-bit values this way, checking every time is expensive. Also, the birthday paradox is massively worse with 8-byte values vs 16-byte ones.

Here’s another nasty trap for the unwary with 64-bit integers: unless your JSON serializer is very careful about values larger than 2^53, you could end up transmitting them as integers in JSON and lose precision when the values are handled by Javascript. Roundoff error in foreign keys is no fun for anybody…

mikejm · December 21, 2023, 8:41pm

Yes, I could see a benefit of UUID if your database natively stores them as such at native 128 bit (mine doesn’t). But you’re still paying 32 or 36 bytes transit every time you send one back and forth to users as strings.

For example, Elixir can be used well for multiplayer gaming applications. Let’s say you have 100 users in a game.

If you want to update every user’s position/rotation/appearance at 10 fps (crude system, updating everyone every time), you are paying 36 * 100 = 3.6 KB on every update just on the UUID’s and 36 KB/s just sending the UUID’s each second. This is more than double the data than streaming a 128 kbps MP3.

Or you have to make a reference table for the users to map their ID’s to a shorter ID to reduce this which is just more work.

I just mean to say, there are reasons why one would not want to have IDs in such a bulky format if there is no good reason for it. I would have to see how my database system handles attempts at insertions on existing primary keys, but if it automatically returns an error, no pre-checking is necessary and one can then presumably use shorter identifiers with no cost.

Such collisions would still be very rare in any reasonable system. I am pretty sure every DB must return an error if you try inserting an existing primary key record. What else would they do? Overwrite? I can’t imagine. That would be potentially dangerous.

cevado · December 21, 2023, 8:46pm

Identifying users/content by 64 bit Int (like Twitter) vs. 128 bit UUID (string)? Isn’t 64 bit Int far more efficient?

can you elaborate on what you mean by efficient here? as @al2o3cr pointed out there are optimizations that depends on what you’re looking for.
i’d personally start with first specifying what are your current constraints(like, DB system, network restrictions, etc), just so the discussion can go torwards what you really need.

sbuttgereit · December 21, 2023, 9:09pm

This simply isn’t correct. PostgreSQL, MSSQL, and Oracle to name a few are not storing strings for UUID nor are they processing them as strings. These databases* store and process UUIDs as 16 bytes. Databases, such as PostgreSQL, will present the UUID in the canonical string form, but that’s just presentation. There are database products which will handle UUIDs as strings (The only one that comes to mind is MySQL, and I’m not sure it still does), but that’s not sufficient to make such a generalization.

(*OK, there are some exceptions where Oracle actually stores 19 bytes)

You’d be surprised at the subtle kinds of bugs that can crop up because of compatible, non-unique IDs across relations. Since most, especially older, systems will use some form of sequential integer to key records, you’ll sometimes see a bug where joins between header and detail records use the wrong keys to join on… (so, for example, instead of a join ON pohead_id = poitem_phead_id, you’ll see ON pohead_id = poitem_id… you can infer the tables from the column names). Because in header/detail relationships like this, the detail relation will tend to always have many more records than the header relation, I’ll almost certainly have matches in primary key of the detail records to header record IDs… so I’ll get data, but that data may be non-sensical (in the best case) because I’m getting headers and details that aren’t really associated because of the bad join condition… in the worst case the data between different headers will be close enough to the real header to be believable. With database-wide unique identifiers this problem doesn’t exist since I’ll never inadvertently be able to join records like I just described.

Of all the problems that keep me awake at night, UUID collisions are not one. In practice this just isn’t a concern; sure if I’m assuming UUIDv4 (or soon to be v7) and that I’ve got a poor RNG at work, then the chances of collision could become realistic… but I’d have other, probably more serious problems at that point. The reality is that I’m expecting many, many other failure modes to be more likely, including concurrent cosmic ray hits defeating my ECC memory protections. For any real world transaction I might attempt to process, anything I do to recover from those generally are good enough for the colliding UUID case: even if that means telling the user something went wrong ( ) and to try again. In reality, I could say the same for 64 bit numbers, too, though (unsurprisingly) the breathing room is naturally reduced. Finally, most of the time UUID uniqueness matters I’m enforcing that with a unique constraint on the column (which happens if it’s a primary key anyway) which will blow up if I try to enter a duplicate… so no special checks aside from dealing with transactions blowing up for any reason).

So… why UUIDs?

Because there’s typically good library support and broad understanding of how they work (you might call it a standard). Many product have UUID support in the box without needing to resort to libraries, specialized knowledge, or bespoke solutions. This isn’t to say that there aren’t those that need to push performance to the max well beyond what you can achieve with UUIDs, but for most, that’s just overthinking it.

mikejm · December 21, 2023, 9:12pm

Best solution: Short Random String?

Upon further reflection and based on the replies so far, assuming your database returns an error on insertion of a duplicate indexed unique value (which I think all should), I am actually thinking the most efficient solution is an 8 digit or so random alphanumeric string using something like this:

62^8 = 2.18340105585e+14

If your database is automatically returning collisions as errors and you automatically regenerate the ID and try again (as I think any safe system should regardless), then even if these collisions happen once a week or once a day you are still saving major efficiency overall.

If you go to an 12 digit random case in-sensitive alphanumeric string (in case database primary key is not case sensitive) this is only 12 bytes (1/3 of the UUID) and is on the same order of magnitude of value count as the positive range of a 64 bit Int UUID.

Why “Efficiency”

When I speak of efficiency, I refer to data storage costs as well as data transit. For example, I would in fact be doing some multiplayer gameplay where I would want to convey positions/rotations/appearance/etc of users with their ID’s on a continuous updating basis.

In terms of data transit, I believe string ID’s of any random nature do not compress well as they are random data already. I gave an example where you can be conveying 36 KB/s of data transit per user just on UUID’s (100 users * 36 bytes * 10 fps).

You can cut your data transit cost by 78% by using 8 digit strings in the JSON by contrast in this situation and avoid having to map users to shorter temporary ID’s on the fly.

Most importantly, if the database does return collisions as errors, and you can handle this and they are rare, why store bulky ID’s for no benefit?

mikejm · December 21, 2023, 9:35pm

Thanks for your thoughts. I can see the argument for the UUID system perhaps if:

Your database can store them natively at 128 bit rather than 32/36 byte strings.
Your database does not return an error if you try to insert a duplicate primary key record in a table (though I would have to wonder what does then happen in that situation? I have not investigated this with different systems…)
You don’t care about the data storage or transit costs.

In a simple situation, you can cut your data storage cost by 78% on ID’s and transit to and from users as I said by switching to an 8 digit random string (8 bytes vs. 36 bytes of string data).

And if you generated so much data you were getting collisions so often this was a bottleneck (almost impossible), then you could just increase to 9 digit or 10 or 12 digit strings and you are still saving massive data.

sbuttgereit · December 21, 2023, 9:49pm

I can’t think of a serious RDBMS that doesn’t enforce uniqueness of primary keys. That would be a compelling reason to not use that database product or an indication that you just don’t have uniqueness requirement.

But the question of UUIDs and uniqueness enforcement aren’t in anyway related or a good argument for using UUIDs or not. UUID generation, which should be unique, is one way a UUID can get into a table, but far from the only way. If I create a record in my application and decide I want to generate IDs in the application and not the database, a database allowing non-unique primary keys would let me insert the record twice… the UUID doesn’t save me in that case. If I’m importing external data, the UUID may come from somewhere else, again, if I mistakenly re-import the same data, a lack of uniqueness constraints allow me to corrupt the data and using UUIDs provide no cover. So there’s no good argument here for UUIDs. In fact, if there is exists a general argument for having globally unique identifiers, they apply to UUIDs as much as for any imaginable scheme to create them… the only difference at that points are relative pros/cons between schemes.

mikejm · December 21, 2023, 10:14pm

I personally don’t trust my users in any way whatsoever so I have already decided I am not letting them generate anything without at least checking their work.

And yes, there are two circumstances I have considered when you would attempt to insert the same record to a table with the same ID:

ID generation collision
Duplicate user/server attempt (user or server submits same attempt twice)

Since you must handle #2 already anyway, why not just handle #1 as well at the same time, and then you can get away with much shorter ID’s? (Like for example an 8 digit alphanumeric string.)

ie. User requests addition of data, sends request to Phoenix, Phoenix generates ID and attempts database insertion, if fails handles #1/#2 or other error, generates new ID, tries again, then returns succeeded ID to user with completion message.

It sounds like GUID/UUID is used more as a matter of habit or convention than any benefit and they are quite unnecessarily large and accomplish nothing special.

derek-zhou · December 21, 2023, 11:42pm

The down side of it as compared to traditional sequence id is performance. In addition to the insertion speed you probably already considered, querying will also be slower, because auto-inc’ed ids are densely populated, and can also provided useful locality for B+ tree traversal. UUID has the same problem, by the way.

sbuttgereit · December 22, 2023, 12:08am

This is the one real issue with UUIDv1-4 that somehow hasn’t come up. The extra 8 bytes of data for a key reference is a relatively small impact as a percentage of overall data and tends to factor low in the cost/benefit choices for globally unique identifiers, but once you get into a modestly large number of records where you want to be able to have efficiently indexed data, the locality of values in the indexes increasingly matter. UUIDv1, while not randomly generated, doesn’t have a conducive date/time format for ordering; and UUIDv4 (which I assume most are thinking of here when saying ‘UUID’ generically) is the biggest offender in this regard. This is where UUIDs breakdown for database key usage. However, UUIDv6-8 are intended to address these deficiencies (to varying degrees) specifically for database key usage.

I think UUIDv7 is being worked on to land in PostgreSQL 17 (I could be very wrong about that) and I’m looking forward to it.

As for random 64 byte strings, they won’t correct for the UUID indexing issue… they simply duplicate it. I would guess that issues of efficient indexing, search, and sorting are part of the reason other implementations of 64 byte unique IDs will typically work something like date/time or node, etc. into their format and are not just purely random.

And for anyone just reading this thread generically, unless you really have an identifiable need for a globally unique identifier and you can explain (to yourself) cogently why that’s the case… just use sequential integers of some size for keys: they are easy to autogenerate in the database, databases play very nice with them in that they index well and are fast to process, and most anyone you might work with will likely have experience with them if they have any data backed application experience at all.

dimitarvp · December 22, 2023, 6:47am

I don’t see how PUID solves anything at all. Note the “probably” in “cryptographically strong probably unique identifiers”.

If anything, it’s a strict string identifier which obviates any performance gains you would get from using 8-byte vs. 16-byte ID in the DB, as you’ll have to serialize / deserialize to another internal ID format – an assumption on my part because using string IDs for primary / foreign keys would incur performance penalty I’d think.

Furthermore, you already got told that most databases store UUIDs as 16-byte arrays internally so you continuing to argue as if that’s not the case is… puzzling.

And OK, I get you not liking the 36 KB/s thing, fair is fair, but you yourself pointed at a solution that many game engines out there do: make a mapping table. Every self-respecting game utilizes protocol compression to as much as realistically possible heights, to the point of player IDs using 4 bits in an older game whose name now escapes me.

Whatever ID you use you’ll need a mapping table if you have constant traffic. It becomes an irrelevant implementation detail.

Lastly, most DB engines nowadays support UUID quite well. Choosing not to use it is likely going down the road less traveled and that can introduce new and exciting performance problems.

Frankly IMO you are overthinking this. Your one concretely stated problem has a trivial solution. I’d be more interested in a follow-up post after you have attempted to use a mapping table and share your findings then.

EDIT: Warframe, my guilty pleasure game, never even goes above ~65 KB/s even in a full party of 4 people and constant action (shooting, melee-ing, kills, interacting with objectives, moving forward in a map packed with enemies etc.) Apparently they compress presence and other traffic as well.

andzdroid · December 22, 2023, 1:39pm

Yet I see few to no thoughts on how to do this if so. One idea is just to use a completely random 64 bit int. Since Elixir doesn’t natively support these, would the idea be to create something like a Rust script for it and use Rustify to generate a queue of them for Elixir to pick from?

Twitter says they do theirs to be roughly sortable by time as follows: “To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number. Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).”

Fyi there are lots of libraries for twitter’s snowflake IDs, including in elixir:

Twitter, instagram, discord and mastodon all use snowflake IDs.

adw632 · December 22, 2023, 1:58pm

This is not why they are used. It is for distributed and disconnected generation.

128 bits is sufficiently large, making the chance of collision infinitesimally small 1/(2^(128/2)) or 1 in 18 quintillion (18 with 18 zeros).

If you still want to argue that you need to check for collisions out of some sense of paranoia then let’s indulge that paranoia for a bit shall we…

It is more likely your RAM and storage is being corrupted by gamma rays, or the earth is anhilated by an asteroid or the universe is actually in a false vaccum and spontaneously collapses to a lower energy state extinguishing the cosmos and fabric of reality.

Considering all causes of the earth being destroyed in any year by those events and others, the probability of Earth’s destruction is estimated at no more than 1 in a billion (9 zeros), however that is practically a guarantee compared to an improbable collision rate of 1 in 18 quintillion.