Fault Tolerance: Failover/Takeover on Webservers

OTP Applications offer great possibilities to either run distributed on multiple servers side-by-side, or wait for the ‘main’ node to fail, in which case the next node will take over (and when the original node comes back online, this one becomes the master again).

However, I am not entirely sure yet how this would look if you want to use it for your webserver.

Most importantly, you have the DNS, which has a static IP in it to your webserver. Is it somehow possible to give multiple IPs for alternatives hosts here if the main one fails?

Other language’s ‘solutions’ I have seen involved using a load balancer that distributed to your application, making the load balancer itself a strong single point of failure.

How would this work in practice?

3 Likes

I’m posting from mobile so I’m not going to get into a lot of detail but the answer to your question is pretty much nope, because that’s how the internet works.

Multiple DNS records aren’t failovers, so you can’t depend on that. So the IPs that your DNS records point to need to be rock solid.

So when you say other languages depend on load balancers, what I think is the the internet depends on rock solid reverse proxies for uptime.

There are ways to mitigate the impact of a single point of failure, but the internet runs with that single point of failure built-in.

If anything Erlang makes it so that at the application layer the single point of failure gets mitigated quickly is isolated. But that doesn’t account for system failure or network failure.

A mix of hardware and software solutions are used to handle those things, which is why we pay money to companies that will handle that for us. Services like AWS and Azure run under some crazy mix of reverse proxy solutions that handle all of that for us.

1 Like

I’m also interested in this. I think a combination of round Robin DNS and load balancers is used. Looking forward to hearing from those in the know. (Wrote this before someone in the know responded. :slight_smile: )

Are there any distributed databases that work well with Ecto? I wonder how easy it would be as an Elixir beginner to use something like rqlite without Ecto if necessary.

1 Like

In theory you could build a great custom reverse proxy in Erlang that handles fault-tolerance in a better way than most of the reverse proxies, but you’re still looking at a single point of failure.

Other language’s ‘solutions’ I have seen involved using a load balancer that
distributed to your application, making the load balancer itself a strong
single point of failure.

This is the way it has to be done here as well. The initial routing has
to be done some way. There are generally two redundant load balancers
handling this, or you can have two (or more) internal servers acting as
“routers”. Like nginx/haproxy or something.

Another alternative is to use DNS load balancing which do a round-robin
selecting of IP addresses behind the same address. This is not ideal
because DNS doesn’t check for errors and thanks to the DNS TTL it takes
long to update the ip address list.

Cloud providers, like amazon, can help you here of course.

Erlang distributed failovers generally work best on the internal network
so you need traffic routed in here first.

1 Like

The context you described is part of why I became motivated to build an experimental DNS server with pluggable resolvers and an kind of edge-network/CDN library to see if there are better ways to distribute workload, push workload closer to consumers (closer in Internet terms) . This with the intent to:

  1. distribute compute workload to distributed servers that are close to consumers and that can scale dynamically

  2. aid in service registration and discovery from both the internet side and the erlang cluster side (aided by (1) above)

I feel like many apps follow the web server -> generic large server (or server cluster) -> database model because thats how most tooling and infrastructure demanded it.

Working in Elixir/Erlang has been thought provoking and suggests its worth experimenting with other architectures that distribute workloads and content better where better means closer to the user, with less users affected by potential outages and delivering better perceived performance. Also where that distribution is managed more or less automatically and transparently to application code.

I realise this doesn’t provide any specific insight - but its an overall interesting topic and I hope to learn from the discussion.

3 Likes

The way this is done is using shared or “virtual” IPs. Two servers on the same VLAN are configured with the same IP address, but the standby keeps that interface disabled until it detects that the primary has failed (using heartbeat or some other monitoring software). Cloud software providers might call this a floating or elastic IP, and you have to use their API to configure which instance has the IP at any time (see an article here for Digital Ocean HA on DO)

If you are hosting yourself generally you set up a load balancer this way and it can distribute work to N pool members without being a single point of failure. Cloud providers generally do this with their load balancer offerings (or can quickly and automatically start a new load balancer instance with the IP you need).

1 Like

Thank you for all the replies so far!

Do I understand correctly then that if there is some kind of hardware failure where a whole server farm at once is affected (internet outage, power outage, earthquake etc.) that even a distributed OTP system is not a solution here?

Maybe it’s too much to ask for, of course :slight_smile:. It will still protect you against local hardware failures (malfunctions in a single server) and this might be enough for many applications.

If I understand correctly, using a load balancer or load balancer -> "router" -> distributed OTP system type of setups, the former steps are still single points of failure. Of course, I understand that the idea is that a load balancer or a routing internal server is supposed to be doing a specialized task that should be simple enough to not encounter logical software errors in production, but if the server these run on encounter a hardware problem, it does not matter how distributed the system behind that first part is.

If I understand correctly what cloud providers do, they manage the load balancer -> router-part for you, but of course you are now dependent on their services and basically you have the same guarantees as above (the difference being that you pay them to have a specialized team managing this part around the clock).

I am actually quite surprised that the DNS protocol is not built with fault-tolerance in mind.

Yes… and this is of course a really bad reason to keep doing it that way, because it’s not being done because it is good, but because it is the only way that is allowed.

Honestly, I don’t think that service registration is a good fit for DNS just because you can’t guarantee DNS cache expiration works like it’s supposed to and you can’t handle failure in the same way you could with a reverse proxy.

If it were me, I would be working on regional reverse proxies to handle service discovery and then getting a Start of Authority(SOA) name server that has GeoIP/Regional Record capabilities. Then I would outsource the name server if I could.

Like the takeaway I have here is that the internet itself is the problem and not our servers and/or applications specifically. They are just another cog in the running machine, sure they can fail but we work around that. What we can’t do is fix every cog in the machine that is the internet.

That’s not to say that your plan isn’t interesting or feasible. But there are restrictions to what DNS can do in terms of load-balancing, and even more problems pop up when you consider service registration under WAN resolved DNS (meaning that LAN DNS where you query the SOA every time is different).

Can someone explain (or point to resources about) how DNS results are (allowed to be, and are in practice) cached, and what exactly a reverse proxy does in this context?

(Is a reverse proxy more or less the same as a load balancer in this context?)

EDIT: What exactly is the role of the DNS SRV field in this picture?

Not quite. The difference is that they use a series of hardware and software solutions to accomplish this. The key here being hardware. More than likely they are using hardware instead of software to route and load balance your application at the same time.

If the hardware that is handling the load balancing fails then the backup hardware kicks in automatically (or more likely the clustered hardware gets more traffic) and at worst TCP connections are dropped, there’s a chance that doesn’t.

Moreover the types of failures you are talking about are stateful TCP connections. Even in Erlang OTP those connections are bound to a process and if that process fails then are probably returning a 500 response.

.DNS is just an addressing scheme. The real details behind DNS are the authority rules. There are top level root name servers that basically run the internet. They keep track of the TLD authority servers (.com, .org, .io all have their own authorities). Then these authority name severs keep a list of the name servers that host the A records and such.

Basically if you were to start from scratch you go through 3 different domain servers to get an A record. That’s the beauty of DNS. This system is actually fault tolerant in the way of using Anycast jumps, but Anycast still assumes that endpoint is more or less functioning. The retry mechanisms after that are basically trying the next server in the list. (I think that’s an implementation detail though)

1 Like

Every DNS record has a Time-To-Live(TTL). Where the authority for that record says, “It won’t change for this amount of time.”

It used to be a super widespread problem that DNS servers would cache this by their own rules and ignore the provided TTL or hosted name servers would set a super long TTL to punish you for leaving. This doesn’t happen as much anymore but some companies/schools still use their own TTLs for caching purposes.

In this context a reverse proxy basically sits at the supplied IP and if the application IP changes then the reverse proxy can handle graceful shutdown of existing connections to the old application while new connections to the new application IP.

Yes and No. Reverse Proxies are software proxies in a lot of cases. But they also usually connect to an incoming connection then connect to a server and pass information back and forth. Meaning you can do things like authenticate through a proxy or apply session in a proxy.
A Load Balancer especially hardware versions work at a much lower level, they will probably keep a table of some sort with origin IP and port then route traffic based on those headers. Meaning the pass-through is much faster and goes directly to a server.

A load balancer is basically just a table of existing connections, available endpoints, and a scheduler. A reverse proxy is a intercept between two points.

The thing is, I don’t run my own server farm and I don’t have a hardware load balancer. So instead I’ll just use a reverse proxy because it’s software and good enough. But if the reverse proxy goes down then all your connections drop, with a load balancer they probably won’t.

See haproxy and Multi-Layer Switch.

A SRV record is a special record for applications or protocols that want to use DNS for things other than just regular lookups. Basically if you create an application that requires a special server then you could use a SRV record to give its location and port info.

EDIT: I’ve never actually seen a SRV used in real life. Even when I looked I didn’t see it. This is possibly because SRV records are really nice internally, but the dynamic nature of most of the things the internet is used for these days DNS lookups aren’t quite good enough for their purpose. In other words, why not make an http request to get the information that I can guarantee is hot and fresh when a dns record could be as stale as the TTL.

1 Like

Thank you for the detailed explanation, @Azolo :heart_eyes:!

I dove a little deeper into what SRV records are exactly. Turns out they have been used and supported for a long time in VoIP, XMPP, Kerberos and other common network protocols and schemes.
Instead of the ‘round-robin’ scheme with the A (or AAAA) records, the SRV allows you to specify the order (priority) in which the specified hosts should be tried if one does not work. This is exactly what we’d want to ensure fault-tolerant webservers.

However, even though there has been more than ten years of people requesting support to be added to browsers, they still do not support them, for badly explained reasons.

This is so weird…

1 Like

The DNS system provide for quite a lot of goodness that is hidden behind poor abstractions from the api perspective and limited use of the power from a standard application perspective.

For example, consider an http client looking for an http server. Because “the web” came before standardisation of the SRV record type, the semantics are built into the URL. But an SRV record is intended to provide a means of service discovery.

If we were starting today, an http client (web browser for example) would ask the DNS for the name or address of something that provides the http service over the tcp protocol for the given domain:

_http._tcp.example.com 

says “give me the server that provides the http service over the tcp protocol for the domain example.com”.

Whilst DNS does have some basic way to define server weighting and priority, there can be other ways that a DNS server could decide what address(es) a client should consider. These considerations might be round robin, network distance, CPU load, …

These considerations aren’t built into the DNS spec, but are possible.

Similarly, for an Elixir/Erlang cluster, the API for such service registration and discovery could leverage the same repository but with some slightly different contracts. The TTL which is an important part of DNS caching and distribution, could be ignored in a BEAM-side API. Thats part of my experiment. Given that the DNS records, distribution (i.e. slave name servers) are well known and support - how could that be leveraged in a more BEAM-specific way. One addition is to add mDNS on the side - that would appear to fit more easily into the Erlang cluster model since in that model most networks would be encouraged to be on the same subnet. But again, mDNS still builds on the DNS design.

Anyway, I’m not trying to sell anyone on DNS here, Im just experimenting. Service registration, discovery and distribution are interesting topics - and they link to Qqwy’s additional question on content distribution which is itself a subset of computation and content distribution …

Yes, DNS is a name service - it doesn’t provide any guarantees as to availability. Which server to use is more of a client responsibility. The name service provides some suggestions is all.

But there is room, within the standard, to provide suggestions that build on top of the standard.

For example if I know that the client request comes from a particular location (either geo terms, or network distance) the name server could make a determination as to the weighting and priority of the servers it provides back to the client.

Anycast networks have the “network distance” part built in. But to make that work you need an ASN, some knowledge of BGP - its a level of complexity above the normal TCP routing layer.

Just an aside (and not wishing to dominate the thread), the Dynamic DNS Update Leases provides a standardised mechanism whereby the name server informs clients when there is a change to certain records. This is defined as part of the wide area mDNS spec and it largely overcomes the TTL/caching issue. But for some reason wide are mDNS also hasn’t been widely adopted.

DNS is horrible, like utterly horrible Horrible. It was not designed for how it is being used now. Currently a single lone DNS server in, say, Egypt, can cause a bad DNS network propagation that takes down, say, Youtube for 20 minutes for a large portion of the world until a new update fixes it (yes this happened). There is no security, no checks, no fail-over, no priority handling, etc… The original devs of DNS say that it SHOULD NOT BE USED, and they have in fact created a protocol that could replace it but fixes all those issues, but it is not and likely will not ever be used because DNS is so entrenched now…

But yes, I’d say you need a front-end reverse proxy. I’ve used Nginx for that for years and years with no issue at all, it is awesome. :slight_smile:

1 Like

So you put your faith in this server running nginx not falling over?

That is it’s sole purpose, it runs nothing else, does nothing else. If I really cared I’d have a failover with the same IP that was disabled until needed as described earlier in the thread, but that is all it does via fiber-links to the actual servers, its never gone down yet (though now that I say that it may be the first time, the hardware has been updated and replaced via hotswap most of the time with one motherboard change but the new one was brought online before the old one was removed so no downtime anyway, I’m lazy when I can be, I prefer things to just always work).

1 Like

I agree with pretty much everything in @OvermindDL1’s post. Except if I needed availability I would rather have hardware be my point of failure rather than software.

Also that hardware would be managed by someone that wasn’t me or my coworkers.