When to use associated schemas vs embeded schemas?

MatijaL · February 8, 2024, 10:07am

Hi,

this question is not strictly Elixir related but as there are a lot of very knowledgeable people here, I hope someone will be able to clear some things for me.

This is just a hypothetical question… let’s say we have an app with users and those users can create links, something like linktr.ee, or bookmarks or something similar. Those links are not related in any way and each user has its own set of links.

One approach is to create a table with links where each link will be associated with a user, create an index and then, on every page load, those links will be queried and shown on the page. That’s the common approach but what if we have let’s say a million users or more and each user have on avg 20 links, I assume that even with index, it would be a lot of work for the database.

Other approach is to create an embedded schema and hold all the user links there. I assume this would be much easier on the database as all the links are in the same place. I know that writing and editing data is a little bit more complicated but if most of the time data is just read, is this a better approach?

Unfortunately, I don’t have experience with databases that contain a lot of data so I know I make a lot of assumptions and my question will probably cause some face palm so I apologize for that, but it’s something that’s been bothering me for some time now and I just want to understand this better.

Thanks guys

dimitarvp · February 8, 2024, 12:22pm

Do you expect a link owned by user A to be able to be transferred to user B?

If that feature is desired then maybe classic linked tables is the slightly better approach, though even that is debatable, nothing stops you from having a context function e.g. MyApp.Links.move(link, new_user).

I do agree with your assessment that having a lot of data splattered in several tables might be a stress for the DB from one scale and further, though I can’t estimate what scale would that be; might turn out to be a non-issue as it very often happens with proper DBs like Postgres.

I’d likely do embedded schemas. Could be a slightly better future-proof design.

MatijaL · February 8, 2024, 12:34pm

No, each user has it’s own links or some kind of data, I picked links just as an example.

Yes, this is what I would like to know, at what scale would using regular tables become a problem, and will it even become a problem. In the past I always used regular tables with associations in similar situations, but I’ve been wondering if embedding data would be a better suited for such situations.

kokolegorille · February 8, 2024, 12:39pm

If You need to search links, I would not use embed schemas

linusdm · February 8, 2024, 12:52pm

If you’re really interested in these kinds of things, I highly recommend reading Designing Data-Intensive Applications. This book has helped growing my understanding of databases in general. It’s very well written, and goes into detail how a database actually works, and what the tradeoffs are for different design decisions.

Btw, millions of rows in a database is peanuts You have to add a couple of zeros before you get a database to sweat at these kinds of problems.

linusdm · February 8, 2024, 12:54pm

You can index and search in jsonb values too, no biggie. But it’s less trivial, I must admit.

joey_the_snake · February 8, 2024, 1:37pm

Embedded schemas usually end up being stored as a JSON type in the database. One thing to consider is the cost of serialization/deserialization. I have personally seen cases where the cost of this outweighs the cost of a properly indexed join.

dimitarvp · February 8, 2024, 2:24pm

Oh, that’s an excellent point.

sodapopcan · February 8, 2024, 2:52pm

Joe Armstrong used to talk about this as a general factor of slow computing of course we must comply I wonder if there is anywhere else our industry abuses constant serialization and deserialization

The main metric I use for embedded vs table is simplicity and convenience. Scale is too big a can of worms to worry about, although absolutely it’s good to think about. My rule of thumb with databases is “just don’t do anything obviously stupid” You never know how requirements change or what your usage will be like. For example I worked on a very large database but it was a walled off ERP system with only a few thousand people across under 200 accounts who used it every day. We had some slow queries that weren’t worth optimizing since they only ran once per day per account. As far are requirements go, maybe you will need relationships bewteen links in the future, you never know! If you are making enough money where that is a problem and you went with embedded, it’s actually not the biggest chore to have to switch to tables.

dimitarvp · February 8, 2024, 3:55pm

I could probably find 50 examples if I really tried but, consider if HTTP is a good way to encode forms? It’s not. JSON has already been mentioned – it’s a waste.

There is stuff stuff like FlatBuffers or Rust’s RON – or a subset of ASN.1’s BER encoding – that are much lower overhead. But there is still some.

To me though the IT area overdid all the “everything must be text” thing. PowerShell with its structured data + piping has the much better idea.

And if you always need a human to be able to review the data, why not just finally author tools that can translate the binary form to text and vice versa and just be done with it, now and forever?

Too mainstream. Everyone grumbles and goes back to work.

sodapopcan · February 8, 2024, 4:02pm

I was joking I was talking about JSON over HTTP There are lots of younger people and older but new-to-web-development folks who actually aren’t aware that you can send HTml over HTtp.

sbuttgereit · February 8, 2024, 8:11pm

My expectation is that the multiple tables approach would outperform and make for less load than the approach involving embedded schemas. RDBMS systems, especially mature ones like PostgreSQL, are highly optimized for dealing with row oriented data. That is, after all, their purpose in life. As a rule of thumb, absent any reason of substance, don’t try to second guess the normal mode of working with the tool: use that mode until there is a demonstrable reason to go to some alternative approach.

But lets think through some of the trade-offs.

Note that my knowledge of the following can be described as “hand-wavy” as I’m just a frequent user of databases and not a builder of them… expect some errors in fact or understanding in what follows.

First, let’s consider how PostgreSQL deals with rows at a lower level.

Under ordinary circumstances, PostgreSQL stores rows in 8K chunks or “pages” in the file that makes up a table; in addition, PostgreSQL will store 4 rows in each page. This means that the data (and overhead) of each row in the table is limited to 2K in size. Of course PostgreSQL allows rows that are far bigger than 8K, let alone 2K and the way it does that is via the TOAST (The Oversized-Attribute Storage Technique) mechanism. This is a different file on disk which stores larger values (PostgreSQL: Documentation: 16: 73.2. TOAST).

If your embedded schema JSON object pushes past those in-table storage limits, you’re going to be adding values to a second TOAST file anyway; in some ways this acts like a de facto second table. To be fair TOAST works really well in practice and I rarely sweat if I’m going to be TOASTing attributes or not, but if we’re splitting hairs at this level, we’re in the range where these questions will matter. TOASTED values have a property that can make it pretty expensive when updating TOASTED values: they’ll duplicate the entire value on update.

So, lets say that your embedded_schema entries are stored as an array of JSON objects for each user record. If a user has a lot of entries, adding/deleting/updating a single entry will result in the entire collection being re-written. On the other hand, I’ll bet that the data you’re talking about stored as regular rows in a table, will pretty easily allow each entry to sit as individual rows in the 8K page of the primary table file. Inserts are just new entries in the table, updates will duplicate the individual row being updated, and deletes simply mark the deleted record as deleted. No re-writing the big JSON value for each change.

If table data is actually giving you a performance problem, in an RDBMS there are typically better ways to address that than denormalizing the data into blob-like structures. Alternative indexing strategies (BRIN, GIST, etc) or techniques such as partitioning are often times better than trying to devise one-off methodologies.

Really the only time I would suggest using JSON/JSONB (and similar complex data representations) is when the data itself is genuinely unstructured. Otherwise use tables.
For many applications genuinely unstructured data can be rarer than expected.

Here’s some further reading that might help understanding some of the dynamics better:

sodapopcan · February 8, 2024, 8:17pm

I was hoping you’d respond to this thread Always love reading what you have to say about databases.