Different DB connection per request?

sabri · March 8, 2017, 10:03am

In a phoenix app, we will be serving multiple schools, each school has its own DB.
Using the same route, we have to deal with a specific DB depending on for ex: school_id in request.

The question is, what is the recommend way to accomplish this using Ecto? as we are using Repo.insert_all?

Also, when Ecto repo get connected? is it when starting the phoenix server? and it will stay connected then? or it will be connected on each request?

Knowledge feeds are highly appreciated!

hubertlepicki · March 8, 2017, 10:20am

There are two ways of doing what you want. You either need to open N connections to N schools. They will be persisted, each one with individual connection pool. You’ll have to refer to each, by it’s own Repo module, for example Schhol1Repo, School2Repo etc. This blog posts explains in details how to do it:

https://edmz.org/personal/2016/03/09/multiple_databases_with_ecto_and_phoenix.html

but basically you need to specify multiple configs in your config.ex:

 config :school1_ecto, MyApp.School1Repo,
   ...
 config :school2_ecto, MyApp.School2Repo,
   ...

And then start them up in your application callback module with:

supervisor(MyApp.School1Repo, []),
supervisor(MySpp.School2Repo, []),

This may work if you have fixed number of schools.

Generally, if you use PostgreSQL, there is a better solution for multitenancy. PostgreSQL supports schemas, and ecto has a built-in schema prefix support. This could be used with a good helper library: https://github.com/Dania02525/apartmentex

With the second approach, you use single database. But it is logically partitioned using PostgreSQL schemas

I would highly recommend the 2nd approach.

sabri · March 8, 2017, 10:48am

Thanks, but as for option 2, would I still be able to calculate the size of each logical schema? (storage size on disk) ?

hubertlepicki · March 8, 2017, 10:49am

I am unsure about the actual disk usage, but the size of data can be computed quite easily: https://wiki.postgresql.org/wiki/Schema_Size

sabri · March 8, 2017, 11:02am

Thanks for enlightenment

By the way, does Ecto keep the connection open? thus its always ready or it will close it after executing query?

OvermindDL1 · March 8, 2017, 5:30pm

However Ecto has no capabilities whatsoever (and actively prevents) joining across schema’s, so you cannot, for example, have a schema holding all account related information then join that to other parts.

Ecto has a pool that remains open.

sabri · March 9, 2017, 10:04am

Ecto has a pool that remains open.

But what happens when I run concurrent tasks to access DB? does each task create a DB connection? or that all tasks uses the same DB connection?

hubertlepicki · March 9, 2017, 10:11am

You mean tasks as in mix tasks? In such case - yes. Each one would start it’s own ErlangVM instance, load application and handle own pool - if started from shell. I believe you can work this around by starting mix tasks from within started application, but you’d need mix in either case (and it’s often not available).

If you need to run some tasks on production, periodically, from the same VM instance, sharing the same connection pool as web app, I’d recommend you have a look at quantum library:

sabri · March 9, 2017, 10:52am

Sorry, I mean tasks in Task.async not in mix, think of multiple async tasks running, each one needs to store a record in DB, would there be a connection for each write to DB? then, how would those connections get closed?

michalmuskala · March 9, 2017, 11:06am

There’s a pool of connections. When a process needs to access the database, it checks out a connection, does its business and checks the connection back into the pool once finished.

Connections are not closed, unless an error happens - they are kept alive for as long as possible since establishing a connection is a costly process.

hubertlepicki · March 9, 2017, 11:07am

The way you describe - connection pool works in Rails/ActiveRecord. But Ecto is different. It does not keep the connection reserved for particular process throughout the whole life of processes that use it.

Instead, Ecto checks connection out of pool only when it wants to perform operation, and immediately after - checks it into the pool again.

I blogged about it here: https://www.amberbit.com/blog/2016/2/24/how-elixirs-ecto-differs-from-rubys-activerecord/#handling-database-connections

radar · March 9, 2017, 11:40pm

While schemas in PostgreSQL are appealing for multitenancy, it’s not recommended at all because it can cause backups to take a long time if you have a large amount of schemas, tables and data.

(I wrote a book called “Multitenancy with Rails” Multitenancy with Rails - 2nd… by Ryan Bigg [PDF/iPad/Kindle] and found out about this during my research for that book)

Here’s what Heroku says about it: Heroku Postgres | Heroku Dev Center

The most common use case for using multiple schemas in a database is building a software-as-a-service application wherein each customer has their own schema. While this technique seems compelling, we strongly recommend against it as it has caused numerous cases of operational problems. For instance, even a moderate number of schemas (> 50) can severely impact the performance of Heroku’s database snapshots tool, PG Backups.

While this specifically says that’s the case for Heroku’s tool, they use pg_dump which also won’t backup large schemas in a timely fashion. So that rules out schemas.

Judging by what’s been talked about in this thread it looks like the multi-database approach is out too.

So what I recommend is that you look at how you could accomplish this using standard foreign key scoping. If you’re rigourous enough in your testing that scoping is in place where it needs to be I don’t see any problem with that approach.

OvermindDL1 · March 9, 2017, 11:47pm

Schemas at my past few jobs have been used to separate ‘concerns’, so there is a limited number, however it is expected to be able to join across them, which Ecto is utterly broken on…

Nah, for actually scaling I use postgresql with a proxy that sends connections out to the different databases behind it.

hubertlepicki · March 10, 2017, 6:06am

Ah, I never saw that because our backups are configured to back up each schema individually. Which seems to be solving the problem you describe at the same time.

I don’t think that slow backup process should be the definitive reason to declare this approach “not recommended”. I would go with this by default and solve any problems I may have later.

radar · March 13, 2017, 12:03am

How many schemas are you backing up and how frequently do your backups run?

hubertlepicki · March 13, 2017, 6:19am

I’ve got couple of dozens accounts, some schemas as big as 3GB. The luck factor may be in my case, however, that those are business users from the same time zone, and we do backups at midnight, while there is very little traffic. Or you simply have way more users/data to back up :).

We’re also doing it simply because clients want to have these daily back-ups accessible at any time, and to have off-site back up. We do have a layer of backup for our own purposes on different level - archiving WAL entries continuously with wall-e https://github.com/wal-e/wal-e.

This is brilliant since it allows point-in-time recovery of whole database. We had a security incident a few years ago where this actually helped a lot to track down what happened.

I assume you have way more schemas to back up?