Phoenix Slug URL/Primary Key Best Practices

jswny · March 22, 2018, 3:53pm

Hi all, I have some questions on the best practices regarding primary keys for my schemas in Phoenix (Ecto – but this specifically relates to Phoenix).

Basically, I often come across a situation in which I have a schema like Post, which contains some kind of text post with a title. Often, for web/SEO purposes, it is good to have the URL use a slug of the title instead of the id. From my experience, there are basically three options for doing so:

Keep the id field and use it internally wherever possible and just use the slug field for the URLs.
Use the slug field for everything (including internal use not related to URLs) and remove the id field altogether, changing the primary key to slug.
Keep both fields and have the URL build off of /<id>/<slug> so that you can still use the id for everything, but you get the SEO benefits of having pretty URLs also.

I’m struggling to figure out which of these options to pick for my applications. The first one requires minimal work in implementation, but then I feel like I shouldn’t have slug which is basically a pseudo-primary key without removing the other primary key and making slug the full primary key anyway.

The above leads me to the second option, which also feels a bit dirty because I have to overhaul the primary key structure that is setup automatically by Ecto and Phoenix for params and seems to be the convention. This issue on GitHub by Jose Valim also leads me to think this way, especially the following:

Setting another primary key is not automated on purpose: we don’t think you should be changing the primary key that frequently. For example, a lot of Phoenix/Ecto works assuming there is an ID field. For example, when you generate a URL? We default to the ID field. When you set up an association? It defaults to the ID field.

So, I am now lead to the third option, which I haven’t actually tried to implement before but I have certainly seen it done on some websites.

However, I’d like to get some input from the community on how they prefer to handle these situations. Any feedback or insight would be appreciated

outlog · March 22, 2018, 4:20pm

1 or 3, never 2 imho - also or rather especially remember the slug might change - but you don’t want to 404 stuff because somebody changed the title…

for 1 you’ll need to keep track of old slugs and have unique constraints - for 3 there is no need…

also for seo remember to have an original canonical uri that never changes… eg /id or something.

LostKobrakai · March 22, 2018, 4:59pm

Imho slug is just another column in the db like any other. It might need to be unique (if it’s the only identifier in a url) but that alone doesn’t make it a primary key. A primary key is not only a identification for the app, but also internally used by the db for foreign keys and such.

jswny · March 22, 2018, 5:55pm

Thanks for your input, I’ve been thinking of trying to go for option 3 lately. Now that you mention it option 2 definitely seems out of the question since I don’t want my URLs to change every time the title gets updated.

How would I go about creating a separate canonical URL that never changes if I was to do the /<id>/<slug> structure though? I assume that I would make the <slug> part of the URL optional and just try to link with it for nice URL purposes?

I was just checking out what happens when I change the title of this thread. Seems like the site redirects the old URL to the new one, which I wouldn’t want to do. Seems like just appending the slug as an optional nicety on the end of the id based URL would be the only way to do what I want without worrying about storing old URLs or redirection.

Yeah I’m not really talking about the database specifics though, I’m mostly asking about the best practices in regards to using slugs in URLs and how to handle that internally, whether it means removing the usual id primary key or not as stated above.

I updated the title because I think I have a better idea of what I’m really asking now

outlog · March 23, 2018, 11:17pm

I’m talking of rel=canonical for something like /id - see SEO Best Practices for Canonical URLs + the Rel=Canonical Tag - Moz

you can also do 301s to the latest uri if that suits you better…

augnustin · May 20, 2019, 11:20am

Here’s my take a year later.

Though I find #1 more elegant, indeed, since a title (or whatever field) might change, slug can change, and you would have to store the history of the slugs to set up redirects which is a pain (and would cause uniqueness issues).

So I guess, best practice is #3.

Take how Stackoverflow URLs work:

[Rails - Send all emails with delayed_job asynchronously - Stack Overflow] is the google URL
[Rails - Send all emails with delayed_job asynchronously - Stack Overflow] redirects to the above as well as [Rails - Send all emails with delayed_job asynchronously - Stack Overflow]

I find this pretty nice.

cnck1387 · May 20, 2019, 11:44am

I used to be gung ho about using slugs and looking up records by slug everywhere (but ID was still my primary key) and I still do think it has its place for some apps but I think the third option is also a great fit in a lot of cases.

Basically number 3 is perfect for when you want the benefit of SEO but your title may change on a regular or semi-regular basis, but using slugs without an id in the URL still has its place for apps where your titles are likely not going to change.

For example I’m building a course platform right now and I plan to have /courses/some-course-title to view a course’s description page because when it comes to courses, it would be very unlikely that you change the title of the course after you release it. I also don’t want to impose IDs in the URL.

But when it comes to the individual sections and lessons, I’m for sure going to use option number 3 where DB lookups happen based off the ID and the slug in the URL is just for the sake of SEO (technically going to a slug name that doesn’t exist would still work since everything after the ID is purely SEO related). That was a common pattern to do in Rails too with to_param and setting "#{id}-#{slug}".

So I guess long story short, it depends.

Just know what you’re getting into.

If you have no ID in the URL and your slug changes then 3 things will happen:

All of your prior links will breaks (unacceptable IMO).
You would have to manually configure a redirect at the web server level (not too bad if it only happens once in a while, minimal complexity too).
Track all of your slugs and auto-redirect to the latest one (a hell of a lot more complex than the other options but likely the best solution since it’s fully automated).

Cochonours · May 20, 2019, 12:12pm

That is option 4 as I doubt SO would expose their DB IDs like that; They probably have their DB ID plus a numeric (could be anything) key that act as a slug, plus the title looking slug which is completely optional.
IMHO option 4 >> option 1 > option 3 > option 2.
I prefer when keys are typeable, so either short numeric keys or better, short texts that won’t change.

augnustin · May 20, 2019, 1:00pm

I never understood why this would be so bad to expose ids.

After all, this must be a unique identifier, and this is the one used through URLs, which is the only place where user has access after all. So why bother creating another DB column, or even a hash function?

al2o3cr · May 20, 2019, 6:59pm

Depends on the specific application; an early-stage startup might not want to expose sequential IDs to avoid disclosing the rate that database table is growing.

peerreynders · May 20, 2019, 8:37pm

CheatSheetSeries/cheatsheets/Insecure_Direct_Object_Reference_Prevention_Cheat_Sheet.md at master · OWASP/CheatSheetSeries · GitHub

http://www.agiledata.org/essays/keys.html

The fundamental issue is that keys are a significant source of coupling within a relational schema, and as a result they prove difficult to refactor. The implication is that you want to avoid keys with business meaning because business meaning changes. However, at the same time you need to remember that some data is commonly accessed by unique identifiers, for example customer via their customer number and American employees via their Social Security Number (SSN). In these cases you may want to use the natural key instead of a surrogate key such as a UUID or POID.
…
5. Don’t naturalize surrogate keys . As soon as you display the value of a surrogate key to your end users, or worse yet allow them to work with the value (perhaps to search), you have effectively given the key business meaning. This in effect naturalizes the key and thereby negates some of the advantages of surrogate keys.

Cochonours · May 20, 2019, 10:15pm

The DB PKs are supposed to be an implementation detail, not something exposed to the end users. Concerning slugs, do you really want ALL your URLs to become invalid because someday you realise you needed UUID instead of integers? It can also quickly become a nightmare when you have to handle several SQL servers because your app is not so small anymore, as your PKs won’t be the same on each server.

augnustin · May 21, 2019, 8:41am

Interesting readings, but still not 100% convinced:

@Cochonours if DB needs sharding, then one needs a unique identifier for all fields anyway, so why not move the former PK ID to this field so URLs remain valid ?

@peerreynders from your links the main argument I read is guessability is insecure. Though this can be true, it is not always, especially when it is about public URLs, which are discoverable by nature (take SO), and anyway, using a column that is not the PK would have the same effect if auto-incremented.

@al2o3cr yes, that’s correct, there are workarounds like obfuscating IDs, but indeed this is an extra layer that is close from not using the ID.

Anyway, I read that there are several solutions that match to several cases, so there is not a single way to solve this, it depends on the need.

Cochonours · May 21, 2019, 9:35am

If you can bet your life on all shards remaining identical then sure. Your shards will have PK conflicts as soon as people/admins start creating new articles, unless you move away from numeric PK and start using UUIDs but in that case your slugs will be long and horrendous. Also, chances are that the shards will diverge as some posts will be specific to one shard depending on whatever business or logical reason, and at that point you will have to redesign everything and lose all your links because you thought one column for the slug was too much work. I often work on a server project where different clients have different needs and the migrations never explicitly mentions PKs as I know I just cannot count on them being similar between instances.

As for guessability, it always brings insecurity. It’s not enough by itself of course, but it’s one information more that an attacker can use. Do you really want people to be able to guess how many articles you have created? How many people subscribed to your website? That’s the kind of info you give away when you expose PKs that are autoincremental.

peerreynders · May 21, 2019, 10:26am

I think here the idea of URL durability is creating an excuse for coupling to the implementation. However in some cases URL durability is trumped by URL discoverability via SEO.

In a foreign key relationship it is essential that the key values across relationships are equal but the actual value of the key that expresses that relationship is inconsequential. The actual value acquires meaning the moment it is exposed - becoming accidentally and permanently coupled to the actual business data.

For the longest time Amazon chose to expose their products uniformly via the ASIN (e.g. B0140DMHB2). ASINs are blocks of 10 letters and/or numbers that uniquely identify a product on a global scale. The ASIN appeared in any product URL.

Meanwhile it’s unlikely that the ASIN was the key value used to establish foreign key relationships inside the RDBMS - it’s likely that a more efficient key type was used instead. By keeping the key values opaque, data replication strategies don’t have to replicate the technology specific key values - only the actual data has to be replicated. So records relating to the same ASIN on different database servers over the globe can use entirely different key values to express the same relationships.

Then Amazon decided to adopt the ISBN as the ASIN for books. This is an example where a business key changes.

An ASIN can now be 978-0321293534 - i.e. an ISBN-13.
However the URL uses the ISBN-10: 0321293533

Amazon likely sacrificed URL durability to enhance URL discoverability via the ISBN-10.

Now had all the relationships been based on the business key (the old style ASIN, e.g. B0140DMHB2) a lot of database records would have to be migrated. The relationships would be unaffected however if they were expressed via opaque surrogate key values - only the ASIN field(s) themselves would have to be changed.

That is why it is important to never naturalize surrogate keys by exposing them.

By extension 21556050 is the identifier for that stack overflow question (it’s exposed and therefore is a business key) but the record where that data resides will likely have a separate surrogate key value, a value that could vary on every physical RDBMS it is replicated on (depending on the replication strategy).

If that is true question identifiers could be handled via a centralized service that allocates entire blocks of identifiers to other services that consume them.