I am looking at the best way to identify my users and content on a social type app (user profiles, media, chat messages, etc).
UUID vs. 64 bit Int
I don’t like sequentially numbered systems.
As I understand, GUID/UUID means a 128 bit int (16 byte) which is in practice stored as a 32 character string = 32 bytes.
This seems terribly inefficient. Since nothing can handle 128 bit int’s, what is the point in using them? You end up with double the storage and data transfer cost for them. Since you’re working in strings, I presume equality comparisons or matching becomes more costly as well.
Advantage of 64 bit Int:
I note Twitter uses 64 bit unsigned int’s for their post and user identification:
https://developer.twitter.com/en/docs/twitter-ids
Twitter is one of the biggest websites in the entire world. If they can manage this, why don’t more people do this? As long as you are checking for collisions against your database every time you generate a new ID for something, and you are doing this sequentially (calls not casts), then why not?
Even if you use 128 bit UUID’s you must still check for collisions and do it sequentially (birthday paradox), so what is the difference?
The benefit is a 64 bit int can be handled natively in C#, C++, numerous databases (my system does) and costs 8 bytes per instance (1/4 the cost of the UUID). Doing booleans on int64 is presumably going to be much faster than a 32 byte string (right?).
Even if you must transfer the ID’s to users as strings (JSON), the range of Int64 is +/- 9,223,372,036,854,775,808 which is 20 digits long. This as a string is max 20 bytes which is still cheaper at maximum length than the 32 byte UUID. Many random cases will be only 4-10 digits long (4-10 bytes).
64 bit Int in Elixir:**
Elixir I think can work with 64 bit int’s indirectly through Decimal:
https://hexdocs.pm/decimal/readme.html
Is Decimal highly inefficient for things like checking equalities or converting to JSON in some way that we should not want to use it this way?
I know Javascript doesn’t support 64 bit int either but I am not needing Javascript (and in cases where one did, you could deal with them as strings in that area so this wouldn’t break anything).
64 bit “UUID” Generation:
Besides Twitter, everyone I see is either using sequential numbering or 128 bit UUID, and I don’t see why.
I see some people thinking the same as me when I search: https://dba.stackexchange.com/questions/16040/what-is-the-best-identifier-for-a-userid-64-bit-integers-uuid-v5-or-64-char
Yet I see few to no thoughts on how to do this if so. One idea is just to use a completely random 64 bit int. Since Elixir doesn’t natively support these, would the idea be to create something like a Rust script for it and use Rustify to generate a queue of them for Elixir to pick from?
Twitter says they do theirs to be roughly sortable by time as follows: “To generate the roughly-sorted 64 bit ids in an uncoordinated manner, we settled on a composition of: timestamp, worker number and sequence number. Sequence numbers are per-thread and worker numbers are chosen at startup via zookeeper (though that’s overridable via a config file).”
https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake
UUID vs. 64 bit Int vs. Random String
Another approach I see discussed here is using a random string: https://github.com/puid/Elixir. The article I found that from also agrees UUID (32 or 36 byte string) is highly inefficient for many of the reasons I said also: https://medium.com/@dingosky/understanding-random-ids-2768d137f02
If you are checking for collisions each time anyway, maybe this is still efficient and keeps data costs down. But perhaps it is still less efficient to work with than a 64 bit int?
Uniqueness Across Types?
The last question I wonder on this subject is if you are handling things like users, posts, media all with a 64 bit int, is there any need to enforce uniqueness across different data and object types? Ie. Can an abuse report have the same ID as a user profile or photo?
If your database is separating all these different types of data out already, and you are handling them distinctly, then my inclination is there is no need to enforce uniqueness across different data types, as that requires then multiple queries to the different tables also to ensure absolute uniqueness.
Thoughts?
Picking a good system for user/content identification seems important.
Any thoughts or ideas on best practice in this subject?
What are your thoughts?