Distributed Health Check

joaoevangelista · April 7, 2018, 8:23pm

Hi guys.

So I got an idea, mostly proof of concept and so I can learn more about distributed applications. It’s basically a master/slave for health checks across regions of deployment.

Each region has an app reporter that receives heart beats from the apps running in the same region, and report back to a master in another region. Each reporter
does not know about each other, only knows its master. I’m aggregating health checks.

All based in HTTP APIs with a schema based in this spec

Is this something valid? Or total bs to implement such system? I’m open to opinions

jordiee · April 7, 2018, 8:54pm

If it is something you could see yourself using then it is absolutely valid. I believe most people use something separate(saas) to monitor health of applications. The only limitation I see is if you are using distributed erlang stuff like :rpc.multi_call or something similar is this will only really be useful for elixir/erlang applications. I am actually in the process of making a health check type thing for appdoctor.io(nothing there now, still like 4 months off of production testing) and find that doing multi region testing in elixir is really easy/fun with rpc methods. Basically my approach is using rpc.multi_call to make the request from multiple nodes in different regions. I can then report the results all in one “call”.

Circling back if its something you will use then I am sure it is something that others may as well!

joaoevangelista · April 7, 2018, 9:53pm

The reporter and master would be elixir and probably will be using built-in rpc, the apps will report their health during the heart beat or making the reporter POST against the app so I can keep the apps implementation agnostic. I don’t think I will use it, just doing for fun. Thanks!

jordiee · April 7, 2018, 10:00pm

Also if you are just concerned with an apps health you can subscribe to nodedown events. Because if you are going to use rpc and a node is down the health check will fail anyways.

http://erlang.org/doc/man/net_kernel.html#monitor_nodes-1

If a node goes down you could fall back to a POST to make sure its really down or just have a spit brain issue.

joaoevangelista · April 7, 2018, 10:20pm

Nice didn’t know about nodedown, I’ll check it to add to the reporters