Fault Tolerance: Failover/Takeover on Webservers

OvermindDL1 · July 5, 2017, 4:21pm

I agree there too, which is why the front-end nginx is all that it does, it never has issues. ^.^

cmkarlsson · July 5, 2017, 9:11pm

Correct. You can’t rely on distributed erlang over WAN so another solution for these are needed. Geographical redundancy is in fact super hard. It is OK if you have a state-less web application which can be hosted anywhere in the world but as soon as you’ve got state that needs to be consistent it is really hard.

There is no RDBMS database that offer a painless solution to this. Even Oracle with its golden gate technology probably introduces more problem than it solves.

Not really. You have two hardware load-balancers to get redundancy. These have IP failover using a protocol such as VRRP or CARP. They are then of course connected to different physical internal networks so you don’t have a Single Point of Failure there either.

However, High Availability comes at a cost and the simple formula. If the cost of addressing a problem is higher than the risk times the estimated cost of the outage you generally don’t bother.
risk * cost-of-outage > cost-of-protection

And, from my experience high-availability introduces more things that can go wrong. From what we’ve seen at a number of our customers: The most common thing that goes wrong and causes outages is mis-managed database High Availability.

I am looking at you: Oracle Dataguard, Oracle RAC, MySQL Caldera Cluster, HADB.

For a database I’d recommend a master database with async log shipping to a stand-by and a manual failover in case something goes wrong. The important thing is that you have a well tested procedure if something goes wrong.

And in terms or erlang I think the most common thing still is to have a couple of redundant entry nodes distributing to a larger internal cluster.

DianaOlympos · July 5, 2017, 9:58pm

Cassandra got some of this stuff for multi zone geographical distribution, but it is a pain to use.

In general none of the “classic DB” do it well. Multi DC failover is still a dark art and nearly all vendors stuff is magic powder that does not work.