Dynamic database creation for SaaS product platform : Need help

mlg-hub · October 19, 2022, 12:10pm

I am facing an elixir project design challenge… my goal is to create an online store where users can create and manage their stores(stock, users, products ) independently on a share platform.
How should i structure the database ? can every store has its separated database to avoid data corruption or they should all be on a single database and be identified by a store_id column on every table of the database…
can i dynamically spawn a database for each store ? i would like to keep using postgres and not use ets for large storage of data. Thank you

derpycoder · October 19, 2022, 12:26pm

The feature you are looking is called multi tenacy and it can be achieved in two ways:

TL;DR: In query prefix multi tenacy, each tenant have their own tables and data, which provides the security guarantees we need. On the other hand, such approach for multi tenancy may be too expensive, as each schema needs to be created, migrated, and versioned separately.

Therefore, some applications may prefer a cheaper mechanism for multi tenancy, by relying on foreign keys. The idea here is that most - if not all - resources in the system belong to a tenant. The tenant is typically an organisation or a user and all resources have an org_id (or user_id) foreign key pointing directly to it.

Multi tenancy with query prefixes

If you maintain separate db for each customer, you are going to have to do migration for each one separately, which gets unwieldy.

See: Ecto Hex Docs on Multi Tenacy with Query Prefixes

Multi tenancy with foreign keys

It’s better to go with the foreign key approach.

See: Ecto Hex Docs on Multi Tenacy with Foreign Keys

Just implement good authorisation rules, so the boundary is clear between tenants.

P.S. Checkout Plausible Analytics Git Repo, they have lots of customers, and foreign key multi tenacy.

mlg-hub · October 19, 2022, 1:04pm

Thank you @derpycoder for your clear and detailed answer. it is exactly what i was looking for.

riebeekn · October 19, 2022, 1:09pm

If you want to use schemas instead of foreign keys, there is also the Triplex library: https://github.com/ateliware/triplex. I’ve worked on a few projects that used it, works well and is another option.

fram · October 19, 2022, 4:27pm

You should take a look at the LiveSaaSKit project it provides a good starting point boilerplate for multi-tenancy using the Foreign Key approach.
Some notable features are:

mix tasks for creating tenant resources.
user impersonation
admin interface to browse tenant account’s resources.
some commonly used library already setup for GraphQL(Absinthe), Jobs(Oban), Tailwind/AlpineJS

I have been using it as a starting point for my projects and I had a good experience with it.

dimitarvp · October 19, 2022, 4:32pm

That’s a great link, thank you.

mlg-hub · October 20, 2022, 8:57am

Thank you

mlg-hub · October 20, 2022, 8:57am

I will … thank you too

fmn · October 20, 2022, 10:28am

successful SaaS means providing services at scale. if you provide services at scale, you need to have healthy dose of automation around ops of your platform.
having separate databases is an advantage, actually.

atomic, separate resource - a database - gives you great flexibility and control.

it’s easier to maintain backups.
it’s easier to purge customer data from the platform.
it’s easier to divide and conquer upgrades.
you are less likely to hit issues related to database size (all things related to vacuum, data partitioning, data volume).
database downtime has less impact.
it’s easier to move database in case of need…
… and so on and so forth.

everything can be postponed, but addressing those issues today can buy you time and capacity to work on features, and save you and your customers lots of issues tomorrow

derpycoder · October 20, 2022, 11:17am

All your points are correct.

My knowledge on this is however entirely theoretical.
I just read lots of stories on hackernews about all the pain points of going, multi db route.

I simply learnt from others mistake, so I pointed out both of the approaches for anyone to weigh them before proceeding.

Here’s a table for comparison:

Sourced from: Comparison of approaches to multitenancy in Rails apps | Arkency Blog

	row-level	schema-level	db-level
Tenant setup time	Just create a record	Slower (need to create schema, create tables)	Even slower + possible operational overhead
Leaking data between tenants	If you forget a WHERE clause	Get a couple things right and you’ll be fine	You’d need to try hard to get one
Invasiveness	tenant_idcolumns and filters all over the code	Fine	Fine
Need shared tables or merging data across tenants	No-brainer	Can still be done in SQL	In-app only, cannot do in SQL
Running DB migrations	O(1)	O(n)	O(n)
Conventionality	Standard Rails	Occasionally at odds with Rails assumptions
Additional costs	Not really	Not really	What if pricing depends on the # of DBs?
Operational overhead	No	Occasionally. You have an ever growing number of db tables.	You now have a lot of databases
Complexity	tenant_idkeys everywhere	an exotic PG feature & stateful search_path
Where possible	Pretty much anywhere	Are you on a managed DB? Double check if all features and ops possible	Got rights to create databases? Can be done on the fly?
Cost of switching	Just set a variable	Set the search_pathfor the current db connection	You need to establish a separate db connection
Extract a single tenant’s data	Cumbersome	Easy	No-brainer
Per-tenant DB structure customizations	Problematic	Possible	Even more possible

Quick reasons to pick one or another

Source: Comparison of approaches to multitenancy in Rails apps | Arkency Blog

Condition	Recommendation
A lot of tenants?	consider row-level
A lot of low-value tenants? (like abandoned accounts or free tiers)	consider row-level
Less tenants and they’re high-value?	schema-level more viable
Anxious about data isolation? (ensuring no data leaks between tenants)	consider schema-level
Customers might require more data isolation for legal reasons?	consider schema-level or even db-level
On a managed or cloud hosted database?	if you wanna go for schema-level make sure it all works for you
Multitenantizing an existing single-tenant code base?	schema-level might be easier to introduce
Greenfield project?	row-level more viable
Need to combine a lot of data across tenants	schema-level possible, but row-level is a safer bet
Some customers may have exceptional performance/capacity requirements	Consider enabling db-level

Links to discussion on this topic:

Everywhere you see, people recant how terrible it was to separate everything:

Excerpts

Of the pain points.

You can’t really write reports or BI dashboards in ORMs, unless you want terrible performance. So for simple queries, fine, but anything else this won’t work.

Interestingly, I’ve only seen the opposite, going back to the 90s. Eventually managing N schemas ends up being difficult, and you’re also fighting against the database.

We ran a multi-tenant SaaS product for years w/ a schema-per-tenant approach. For the most part it all worked pretty great.

We ran into issues here and there but always found a way to work around them:

Incremental backups were a pain because of needing to lock so many objects (# of schemas X # of tables per schema).

The extra code to deal w/ migrations was kinda messy (as you mentioned).

Globally unique IDs become the combination of the row ID + the tenant ID, etc…

For us though the real deal-breaker turned out to be that we wanted to have real foreign keys pointing to individual rows in tenant schemas from outside of the tenant schema and we couldn’t. No way to fix that one since with multi-schema the “tenant” relies on DB metadata (the schema name).

We ended up migrating the whole app to RLS (which itself was a pretty interesting journey). We were afraid of performance issues since the multi-schema approach kinda gives you partitioning for free, but with the index usage on the RLS constraints we’ve had great performance (at least for our use case!).

After quite a bit of time working with both multi-schema & RLS I probably wouldn’t go back to multi-schema unless I had a real compelling reason to do so due to the added complexity. I really liked the multi-schema approach, and I think most of the critiques of it I found were relatively easy to work around, but RLS has been a lot simpler for us.

Multi-tenant databases feel like the result of a decision where a team either doesn’t know how to architect data models well or doesn’t want to put the effort into doing so. Referential integrity was solved 50 years ago. To demand that one’s data not be commingled with someone else’s in a database is as arbitrary as demanding it not be transmitted by the same pool of network connections used by the server. Our data must not reside within the same physical disk storage or memory as that used by other customers!

I worked for a SaaS that did that. We had 1000’s of clients. Migrations would take all afternoon. We had custom connection pools, custom migration runners, and other weird stuff (this was about 10 years ago.) It was way too complicated, especially since most of the tenants had very little data.

And on and on it goes!!

It’s like Microservices vs Monolithic discussion all over again.

I am team Monolith, so my answers are biased. I feel microservices are overkill same as multi db or query prefix approach.

Also don’t be afraid to use the right tool for the right job.

Like Meilisearch for faster text search.
Or running MinIO if you want object storage.
Or CockroachDB if you want distributed Postgres like DB.

I’m using all 3 and still going to add Rqlite for storing data close to application, in memory. For config management that doesn’t belong on main db.

Add caching for performance, use object storage for backups, use modern distributed db for scale and less downtime, …

fmn · October 20, 2022, 11:51am

i like that matrix a lot! i don’t agree on all things, however.

Tenant setup time: i don’t think it’s true it’s slow. DB and schema creation should not take much time at all. you are more likely to have it slower in case of single, big DB!

Running DB migrations: are fastest for db-level column.
you can:

you can parallelize migrations at will.
less issues with locking.
potentially lower data volume per DB.

Complexity: for common cases, db-level column has lowest schema complexity among all three options.

Additional costs: i don’t feel like it’s fair to add such “what if” exclusively in this line, this column

Operational overhead: in my opinion you mistook operational overhead with number of databases. operational overhead is more related to complexity of the task and amount of work it causes. having n homogenous databases which can be handled with exactly the same tool, in parallel, is not causing any extra overhead. it’s opposite.

it’s much more difficult to migrate “big”, “high trafic” DB used by hundreds of clients, while conforming to SLAs without causing any operational problems, than to run simpler tool n times against batches of small DBs in parallel.

Need shared tables or merging data across tenants: this should be not advertised as feature you want to have by default. i view it as questionable. i prefer DB and schema to be implementation detail by default, not an interface.

fmn · October 20, 2022, 12:05pm

by the way: let’s not mix together separate DB server instance with separate DB. i am focusing on separate DB.

sbuttgereit · October 21, 2022, 8:17pm

As I sit here in the jury duty room awaiting… whatever… I thought I’d put together some comments on this.

These resources are pretty good and get to the gist of things. I think you could do much worse than to use this list for guidance, though I don’t agree with some things said; this disagreement is largely disagreement in emphasis or degree rather than specific fact. So you know where I’m coming from, most of my technical work has been in “Application DBA” kinds of capacities rather than more standard development roles. Also, the application I’m building is using the database per tenant model; this is not for everyone and there specific usage and requirements reasons which both suggest this model and allow for it (I believe).

My comments will assume PostgreSQL is the target. Other databases can have different features and constraints and advice for PostgreSQL will not necessarily be valid for other databases.

TL; DR… Before getting into remarks on the post I’ll give my advice to someone starting out on a new multi-tenant project. I think, based on what I know now about multi-tenancy using the Elixir ecosystem, the default choice for starting out is clearly using schemas (“prefixes”) for tenant separation. The trade offs between ease of implementation and assumptions made in the ecosystem of libraries and frameworks seem to clearly favor this approach. I think you’d only really use a different tenancy model if you had good answers to “Why not just use schemas?” Also note that all of these discussions/advice/forum discussions assume a smaller scale service (as they should); as scale increases the trade-offs and viable/necessary techniques for multi-tenant data retention changes.

Tenant setup time: I don’t completely disagree with this point, but the difference in time between schema-level and db-level suggested I think is exaggerated and probably isn’t significant in the vast majority of cases… and there are ways where you might be able to get the db-level faster than the schema-level, too. For example, creating a template database that you keep fully migrated and then using that template to create the new tenant databases can bypass the logical migration process; that’s basically just a file copy at create time which should be faster than logical recreation of the database via migrations or database dumps. I can’t think of a way to do this kind of copy for the schema based model off the top of my head.
Need shared tables or merging data across tenants: And specifically “In-app only, cannot do in SQL”. This overstates the reality. PostgreSQL has the postgres_fdw foreign data wrapper (PostgreSQL: Documentation: 16: F.38. postgres_fdw — access data stored in external PostgreSQL servers) which allows you to connect two databases together and treat the “remote” database as though it’s part of the “local” database (a simplification) so you don’t necessarily have to do things in-app. It is an extension, but one bundled with the PostgreSQL distribution and a quick search shows that it’s also available in AWS RDS and Azure Database for PostgreSQL.
Conventionality: I might be wrong about this, but the friction to implementing the db-level vs the schema-level feels about equal. With schema-level multi-tenancy I have to be sure that the right prefix is used, but that’s not much different to being sure the right dynamic repo is used for multi-database based tenancy. The row based model requires much less of Ecto, but much more of your own coding to get right.
Operational overhead: This is not a well considered point. For row level, you’ll start to hit scalability constraints on hot tables sooner than either the schema or db level approaches… that’s not to say you’ll necessarily get there, but if you’re going to it will be sooner. Things like table/index bloat and vacuuming problems will become apparent sooner, maybe even the need to look at partitioning strategies if your service starts to gain traction (and before things like sharding). Both the schema and db level approaches will have those same problems, but you’ll get to those problems later since the data is already spread across more files than in the row based approach… so I don’t think operational overhead is “no” for row based tenancy. As for the difference of schema level vs. database level? So what if I have a lot of databases? In the schema model I have a lot of schemas… (in the row model I’ll have a lot of mixed tenant records). At a logical level, the distinction between “database” and “schema” is pretty small. The one difference between them that really matters is about database connections which will be mentioned later. Aside from connections the operational overhead is on par, and the db-level may provide greater flexibility when dealing with early scaling issues.
Complexity: I’m not sure that schemas are really that exotic, and not just a PG feature though it’s rather more frictionless than some of the other DB vendors that support the concept. The greatest complexity here would be related to the ecosystem de facto expectation that you’ll do multi-tenancy with schemas. Personally, I didn’t really consider the row based model precisely because of the complexity of getting it right and that the burden was on me to get it right.
Cost of switching: This is dead on correct and is perhaps a larger issue than represented on the list. PostgreSQL connection cost is high and forces you into third party pooling solutions earlier than you’d hit with other database vendors. Much very good work has been done here recently to improve the story, but it’s still not great. If you have a high tenant count, the number of database connections you’ll need can quickly swamp the resource constraints of the database server. This by itself disqualifies the db-level tenancy model for many types of services, particularly B2C type services; you could connect/disconnect as needed from each database to alleviate some of this resource contention, but you’d introduce some unwanted latency into your application where row-level and schema level approaches don’t suffer this problem.

Anyway, I do agree with the general idea that for multi-database you need to expect low tenant counts where each tenant has higher value data. Schema level gets you a lot of the mutli-database benefits without the early connection scaling issues and is better supported in the ecosystem. I personally wouldn’t consider row level based tenancy until I knew that I would be needing to slice up data stores differently than just by tenant (regional sharding, etc.) the extra development complexity of row level access seems like it better be worth it.