Reliance on failure detection vs Timeouts

Crowdhailer · December 18, 2016, 11:56am

I watched the gossiping unikernels talk from Andreas Garnæs at codemesh. He discussed the SWIM protocol. My takeaway from this was that failure detection was mostly done by timeouts. I.e. a node that did not respond to a ping was marked as suspect and if it failed to respond to several pings it was considered dead. This information was propagated to other node.

In a distributed system failure detection can’t ever be 100% reliable. Therefore is it always good practice to have timeouts in addition to monitoring a process/node when sending it a message to which a reply is expected.

It seams to me that if a timeout is present then monitoring is unnecessary. It can certainly be helpful, i.e. if a failure is detected the caller can be notified before the timeout but it is an optimization.

My question is, is there anything in OTP that could not be constructed without links/monitors?

sashaafm · December 18, 2016, 12:21pm

Well there’s the after syntax in the receive do construct:

receive do
  msg -> IO.puts "#{inspect msg}"
after 5000 -> IO.puts "it's dead, Jim"
end

DianaOlympos · December 18, 2016, 1:12pm

So to go a bit deeper. Monitoring use timeout too.

There is a big difference between timeout and monitoring that said. The difference, and probably your confusion, comes from a fundamental problem of asynchronous things, particularly visible in distributed systems :

You can not separate a node taking a long time to answer from a node that died.

So timeout and heartbeat are your “last resort” way to detect a node that stopped answering.

But monitoring give you an additional thing. It enables you to know when something dies right away and with knowledge of how it dies. Because the system tells you. While a timeout tells you nothing about the system you are talking to.

It can be stuck in a “stop the world GC”, it can be on a one way network partition, you can have used all your possible sockets, the network may have die… or the node may just be dead.

Keep in mind that monitoring is not exactly fault detection. It is the runtime informing you that it dies. Can be a graceful shutdown or a crash. But you are not actively doing check on the way it works. You are just registered at wanting its dara when it dies.

OvermindDL1 · December 18, 2016, 9:18pm

Specifically monitors and links will monitor a remote node via pinging (a singular process for an entire node, not one per monitor/link), when it dies then it sends messages locally that any remote linked processes have died. If a given remote process dies but the node itself does not then the node knows it dies and sends a message immediately. Timeouts are everywhere and are used as fallbacks in all cases.

Crowdhailer · December 18, 2016, 11:12pm

@OvermindDL1

Timeouts are everywhere and are used as fallbacks in all cases.

That’s what I suspected, thanks for the answer. Something I want to play with is making my own GenServer/StateMachine implementations (using macro’s for messages received instead of callbacks). And creating a V1 where I only use timeouts will probably make everything easier.

@DianaOlympos How do you extract knowledge of how a node dies. doesn’t monitoring just send you the message {:nodedown, node}?

@sashaafm not exactly what I was looking for but cheers for the reply