UUID -- wisdom and advice

jxxcarlson · May 25, 2017, 8:21pm

I am thinking of using UUIDs – binary keys – for a new project. A pro is that it would allow users to easily integrate locally created documents in the web app, to merge data, etc. A con is that I’ve never used such keys, so I don’t know what the pitfalls are.

emoragaf · May 25, 2017, 9:28pm

from the top of my head:
UUIDs will use more space.
you lose the creation orden associated with secuencial pks

Qqwy · May 25, 2017, 9:30pm

UUIDs have, as far as I know, two major features that traditional incrementing integer keys do not have:

The ability to create them in multiple database pool connections or even distributed because the chance of a collission is astronomically small. This makes for a more scalable database setup which is, for some applications, required.
The prevention of item enumeration: With conventional IDs, someone can see that you are looking at ‘myapp.example.com/user/100’ which means that there are at least 100 users in the system but also that ‘myapp.example.com/user/55’ will contain info about user 55, and this procedure can be repeated to count and crawl collections where this is unwanted.

I do not know about any disadvantages myself, other than that they are hard(er)for a human to remember, but when you need a humanly rememberable token, using something like a slug is more proper. They are also still somewhat less supported by some systems than numerical keys are.

I would love to hear more from other people though, because I have been wondering about possible drawbacks for some time now as well.

Qqwy · May 25, 2017, 9:32pm

about integrating locally created data: In the case that the user can freely pick the used UUIDs, then you do need to take care of the ‘what happens with a collision’ case.

jxxcarlson · May 25, 2017, 9:37pm

Could you comment a bit on slug generation?

Qqwy · May 25, 2017, 9:55pm

Slugs are generally shorter than UUIDs (often, an approximation of the amount of elements to handle is made, but later adapting to one base-36 digit more is possible) but otherwise function similarly. These are the kind of identifiers you see in URL shorteners or YouTube video IDs for instance.
The other kind of slug is the one where a title of an article (or some other near-unique property of the data structures under consideration) are adapted+shortened to an URL-safe form. This is nice for SEO, but care needs to be taken to handle conflicts, and to not break old URLs when e. g. the title of an article later changes.

friendly_id is a Ruby library that does these things. Its README.md might give you some more info.

jxxcarlson · May 25, 2017, 9:58pm

Thanks!

andre1sk · May 25, 2017, 10:37pm

Depends on your storage layer and load. If say you are using PG and have update heavy workload UUIDs can cause significant write amplification.

cdegroot · May 25, 2017, 11:35pm

Steal this code: https://github.com/cdegroot/palapa/blob/master/apps/simpler/lib/simpler/unique_id.ex - shorter than uuids and you can just copy/paste the bits you need into your own code (I feel that this module is too small to warrant its own library).

jxxcarlson · May 25, 2017, 11:37pm

Thanks! What is your take on UUID’s slowing things down?

Azolo · May 25, 2017, 11:52pm

UUID pitfalls, hmm…

You have to extract the time from them. Which means that they aren’t easily sortable by time.
You can only extract time if using UUIDs that are version 1 or 2.
128-bits is long which means your indexes are going to larger.

If you were looking at using an eventually consistent datastore at scale, all the cool kids are using Twitter Snowflakes these days.

Snowflakes are pretty neat because:

They include a timestamp.
Are only 64-bits.
Guaranteed to be unique. (Application Scope Only)

The main con is that worker generates them.

But they are designed to be decentralized by having the key include a timestamp, worker number, and sequence number. So the cost of Snowflakes is that you have to provide an identity the workers generating them.

For the use case you provided it sounds like Snowflakes might not be the exact answer, but they are simple enough that you can get some inspiration from them.

cdegroot · May 25, 2017, 11:59pm

I think that the code I pasted takes 1-2 µs to generate an id. If that’s a bottleneck, I guess you have larger issues

(I like these 128 bit random ids because of the previously indicated reasons; yes, there are some potential database issues, see e.g. this excellent article, but I’m not sure I ever want billions of rows in a MySQL table anyway. YMMV)

andre1sk · May 26, 2017, 2:14am

Bigger issue is in PG since after checkpoint PG will write full page for any changed page, depending on workload this can cause significant write amplification for UUID pks as more pages are likely to be touched.

jxxcarlson · May 26, 2017, 2:19am

Workload is just me now:-) but I hope it will eventually be large The main reason to have UUIDS is to be able to merge databases and to merge info created “offline”. Complicating the matter is that documents can refer to other documents. Is there a better solution than UUIDs? I am using PG (Postgres)

jxxcarlson · May 26, 2017, 2:29am

I think I can handle 1µs

I read the article – very informative and I like the horizontal green part of the graph. Found this on Postgres: PostgreSQL: Documentation: 9.4: uuid-ossp – it can use the date-stampiefied UUID 1 standard. However, it looks like Ecto uses UUID 4 which is entirely random: Ecto.UUID — Ecto v3.11.1 . Do you know if there is a way around this? (I guess that I could rig a homebrew solution by prepending a date-stamp derived string to the UUID 4 string).

Azolo · May 26, 2017, 3:46am

I think you will be fine using UUIDs if that’s what you want to use. I don’t know enough about what your trying to accomplish but no matter what you choose it only sounds as complicated as you make it.

Honestly, I always found it easier to apply “mock” IDs to offline stuff. Then when syncing I could give an ID. That way I know what is sync’d and what isn’t.

UUIDs are better in eventually consistent applications or when applications need cross referenced IDs. I think you may be better served taking a step back and rethinking where you actually NEED an pk to be assigned.

jxxcarlson · May 26, 2017, 4:02am

Thanks re mock id’s – I believe that would work for me. I’ll cogitate on this some more before going down the rabbit hole. BTW, what is “pk”?

Azolo · May 26, 2017, 4:03am

Primary Key

andre1sk · May 26, 2017, 12:54pm

I would not worry too much about it at this point. We live in the world where storage hardware is extremely fast and very reasonably priced (or at least availible even in the “cloud” unless you tie yourself to RDS, Azure SQL etc.).

StefanHoutzager · May 26, 2017, 1:16pm

UUID’s can be assigned on the client, so you can create relations there also. Moreover checkout the comb guid: https://www.codeproject.com/Articles/388157/GUIDs-as-fast-primary-keys-under-multiple-database