Uuid vs nanoid: ways to express IDs compactly

fireproofsocks · June 22, 2022, 1:22am

I wanted to get a reality check from the wise folks of the forum on the differences and pros/cons in uuid vs. nanoid. UUIDv4 has been my go-to for primary keys for a while, but the nanoid package is garnering some attention (there is an Elixir port). NanoID promises to save on real-estate: a human may be able to type one more easily because a NanoID can be shorter than a UUID (at least on the screen).

Just to make sure I’m getting my facts straight, my understanding is that a UUIDv4 ID is stored as binary data 128 bits long. It is often represented as an alpha-numeric string, e.g. fcfe5f21-8a08-4c9a-9f97-29d2fd6a27b9, but this is just a human-readable view of the underlying binary data. So if I’m doing my math correctly, the usual representation is as 32 characters (discounting the hyphens), each represented by a 4-bit hexadecimal number (0, 1, 2, … a, b, c, d, e, f); 32 x 4 = 128 bits.

NanoIDs, on the other hand, seem to always be strings. So if we store a UUID representation (minus the hyphens, e…g. fcfe5f218a084c9a9f9729d2fd6a27b9) as a literal string, it requires 256 bits because it is represented on disk as thirty-two 8 bit numbers, instead of thirty-two 4 bit hexadecimal numbers – i.e. storing the value as a string requires at least twice the space. I’ve seen this mistake made many times when a database schema uses a TEXT or CHAR column to store UUIDs instead of the native binary format… this mistake can really slow down indexing and queries.

So the question is: couldn’t we just offer a different VIEW on top of the existing UUIDv4? In other words, couldn’t we just represent those 128 bits differently to save screen real-estate? For example, if we choose an alphabet of a-z and digits 0 - 5, we would have 32 characters at our disposal and we could represent a 128 bit UUID using only 4 screen characters, e.g. pf3c. Or if we wanted to expand our alphabet, e.g. to a-z, A-Z, 0-9, plus 2 more characters – that would bring us to 64 characters in our arsenal, and we could represent the 128 bit UUID using only 2 screen characters, e.g. Q3. (This is just another way of saying “base-64 encoding”). Wouldn’t that make for nicer REST URLs? E.g. http://localhost/posts/Q3 instead of http://localhost/posts/fcfe5f21-8a08-4c9a-9f97-29d2fd6a27b9

Am I reasoning about this correctly? It feels like I’m missing something. Am I correct that UUIDv4 requires only 128 bits? So would it be useful to have a package that offered a custom and compact “view” of the UUID data? That way the database and everything else could stick to the tried-and-true UUID generation and support under the hood, but if humans were involved, a shorthand could be used to provide an easy-to-type short-hand of the UUID (e.g. using some base-64 or base-32 scheme). This is more or less the idea behind URL shorteners, it’s just a lot simpler when you only have to represent 128 bits of data.

Am I looking at this the right way? Thanks for any thoughts.

benwilson512 · June 22, 2022, 1:38am

This bit is right, and this is why things like ecto use binary_id with Postgres so that it can be stored “natively” in the DB.

Textual representations are a tricky subject. Some big considerations are “how easy to confuse are adjacent characters”. You generally want to avoid things like 0O00Oo or IlIlIlIl (that’s capital I and lower case l). So usually, particularly when you’re just dealing with transient representation, clarity is more important than succinctness.

As far as whether you can get down to 2 characters, I think that math is off. 2 characters of 64 possibilities is 4032 total permutations. v4 UUIDs have 2^122 possible combinations guid - How unique is UUID? - Stack Overflow.

mayel · June 22, 2022, 1:58am

I would also suggest ULID. There’s an ecto type for it here.

kip · June 22, 2022, 2:00am

There are some new UUID formats, UUID 6, 7 and 8 that are quite interesting and solve some of the challenges of UUIDs when it comes to database serialisation (index locality being quite a big one, distributed support too). But they are all still 128 bits, which I think for a primary key is a pretty good choice.

trisolaran · June 22, 2022, 5:11am

Math is definitely off here. A sequence of 4 characters chosen from an alphabet with size 32 is 32**4 = 1048576 ~1e+6 combinations. A 128-bit UUID has 2**128 ~ 3.4e+38 combinations. Ridiculously far apart.

antoine-duchenet · June 22, 2022, 8:16am

AFAIK, since 64 = 2**6, 2**128 = 64**(128/6).
So you would still need 128/6 ~ 22 characters to represent a 128 bits ID with a 64 characters alphabet.

jc00ke · June 22, 2022, 7:32pm

I switched to and then away from ULID if only because you cannot paste a ULID into a SQL query. There’s no built-in conversion from the textual representation of a ULID to a UUID and it was really annoying to have to manually convert. I like the ideas behind ULID, but until it’s supported in Postgres I’d stick with UUID.

I also wanted to point out another potential solution: UXID. It supports “t-shirt” sizes (small, medium, large, etc) which would get you your smaller IDs but I think they’re stored as strings so the space-concerns are still present.

stefanchrobot · June 23, 2022, 5:50am

There’s also puid:

Puid

Define modules for the efficient generation of cryptographically strong probably unique identifiers (puids, aka random strings) of specified entropy from various character sets

Examples

The simplest usage of Puid requires no options. The library adds a generate/0 function for generating puids:

iex> defmodule(Id, do: use(Puid))
iex> Id.generate()
"p3CYi24M8tJNmroTLogO3b"

By default, Puid modules generate puids with at least 128 bits of entropy, making the puids suitable replacements for uuids.

Nicd · June 23, 2022, 6:16am

Note that a UUIDv4 only has 122 bits of randomness out of the 128 bits; the version digit (first digit of the third part in the typical string representation) is always 4 and the first digit of the fourth part is [89ab].

bartblast · June 23, 2022, 9:17am

Another option is snowflake id.

trisolaran · June 23, 2022, 1:50pm

I’m not a fan of UUIDs, I would use them only if strictly necessary, namely when you need distributed generation of unique IDs. Otherwise, I really don’t see the appeal. However, once the names start to morph into HUID (hopefully unique), MUID (maybe unique) or LPTUID (let’s pray they’re unique) I’ll keep away from them for good

fireproofsocks · June 24, 2022, 10:16pm

haha, I see where I made my mistake. I imagined that I could break up a binary number 128 bits in length into 2 pieces of 64 bits each. But then I made the mistake of thinking I could represent each piece with a 64 bit character instead a character (or characters) that can represent 2^64 (whoops… very wrong). 64 = 2^7, so I would need at least 19 characters to represent 128 bits (because 128 / 7 = 18.29).

Thank you for checking me on this, and I appreciate the links to the alternate packages! Good to know there are so many options.

sbuttgereit · June 25, 2022, 3:05am

Yep. This is the RFC I’d like to see get adopted (an aside, your link to the RFC 404’s for me, I find it at: New UUID Formats). I’ve been watching it for a fair while now and my biggest concern is that it feels like it is moving through the standards process on geologic time scales (this is probably because of my ignorance of the standards process and my own impatience generally). Otherwise, it seems to have the right trade-offs, and speaks directly to more recent trends in using UUIDs with relational databases.

I have heard that ULID suffers some from it’s monotonic option definition, though this article seems to dispute that (How probable are collisions with ULID’s monotonic option? | by Gary Grossman | Zendesk Engineering)… I certainly wouldn’t expect it to be an issue for the workloads I’m dealing with. Working with PostgreSQL, the libraries I’ve seen for ULID don’t really implement handling the monotonic option at all (it’s been quite awhile since I’ve looked though.)

I think if I were to do anything it would be UUIDv6/7/8 as if it lands as a standard, I’d expect it to have better options for official support in the database.

[EDIT] I forgot to mention that for those in PostgreSQL-land… this is also an option:

discussed at: Sequential UUID Generators - 2ndQuadrant | PostgreSQL
and Sequential UUID Generators on SSD - 2ndQuadrant | PostgreSQL

I’ve entertained using it, but haven’t gone too far. I like that it’s a C extension and that 2ndQuadrant are somewhat behind it. I’m not overly eager to depend on 3rd party extension without a strongly compelling use case which I haven’t got now: the built in stuff is sufficient.

kip · June 25, 2022, 3:12am

@sbuttgereit thanks for the heads up on the broken link. Fixed in the original post. Seems like it may have moved recently.

sbuttgereit · June 25, 2022, 3:14am

Looks like it’s been updated: draft-peabody-dispatch-new-uuid-format-04 - New UUID Formats

adamu · June 26, 2022, 4:47pm

You are basically describing Base 32:

iex(1)> Mix.install [:ecto]; alias Ecto.UUID
Ecto.UUID
iex(2)> uuid_traditional = UUID.generate
"3290925b-0956-48a4-9690-cb027c333a96"
iex(3)> uuid_binary = UUID.dump!(uuid_traditional)
<<50, 144, 146, 91, 9, 86, 72, 164, 150, 144, 203, 2, 124, 51, 58, 150>>
iex(4)> uuid_base_32 = Base.encode32(uuid_binary, padding: false, case: :lower)
"gkijewyjkzekjfuqzmbhymz2sy"
iex(5)> [traditional_string: String.length(uuid_traditional), base_32: String.length(uuid_base_32)]
[traditional_string: 36, base_32: 26]

So using Base 32 instead of Base 16 with hyphens saves you 10 characters.

You can save another 4 characters by using Base64, at the cost of being less copy/pastable because of + and / (or - for the URL-safe version):

iex(6)> uuid_base_64 = Base.encode64(uuid_binary, padding: false)
"MpCSWwlWSKSWkMsCfDM6lg"
iex(7)> String.length(uuid_base_64)
22

Note that lengths 26 and 22 are guaranteed, because:

Base 32: 1 character represents 32, or 2^5 values = 5 bits. 128-bit UUID requires round(128/5) = 26 characters.
Base 64: 1 character represents 64, or 2^6 values = 6 bits. 128-bit UUID requires round(128/6) = 22 characters.