Database failure bringing down application

So we’re using MongoDB with our app. If I shut down the database, Elixir will crash. It will try to reconnect multiple times and crash each time which will bring down our app.

What I’d really like to happen is that it retries every 30 seconds or so to reconnect, and then bring the application back up when Mongo is back up?

What’s the recommended way to deal with this? I imagine this must be a very common problem, but haven’t seen anything about it. I don’t think I really want my app to run when Mongo is offline, but we could potentially work from the cache for a while.

1 Like

Maybe run database driver, etc in it’s own supervision tree, with proper options that suits your needs?

1 Like

“Circuit Breakers” work well for this problem. I tried a couple different ones and settled on

It basically is new form of supervisor that allows its children to die repeatedly without bring down the system.

A lot of drivers have been written to use the erlang supervision tree to
get restarted if it crashes. However this assumes the database will be
always up and running. Something isn’t always the case (in fact quite
the opposite actually).

Once you need to deal with errors in the database having a connection directly
under a supervisor doesn’t make sense.

The solution to this depends a little bit on how the underlying
connection and pooling is implemented but generally you’d hold the
connection in a gen_server and traps exits from the connection. Your
gen_server then keeps track on reconnection logic and similar.

Finally the gen_server you write to keep track on this runs under a
supervisor so that it gets restarted in case it crashes.


supervisor -> gen_server with re-connect logic -> connection process

If you have a process pool you can either have each process in the pool
take care of re-connect logic or have the process pool simply remove
processes from the pool if they cannot connect.

Does your app work without the database, is it useful? If not, I’d say crashing the app on lack of DB is totally expected/correct.

If you expect occasional db outages, then maybe it makes sense to build fault tolerance into the DB layer.

There’s a fault in the replica set support of the driver that brings it down in case database is not present. DBConnection itself (the low-level library that powers postgrex, mariaex and mongodb drivers) in general prefers the backoff-and-reconnect approach instead of crashing. In general, it’s not an exceptional situation for a database connection to get disconnected, this in itself shouldn’t bring the system down. How the rest of the system is architected around the fact that the database might not be available is a completely different story.