I am designing a HA system where I have X number of nodes behind a Load Balancer. I need to make sure that if a process dies in the middle of a task, the information that was being handled by that process is picked up by another one. Let me abstract this a bit: let us say that there are Y processes per physical server, I need to make sure all the information (or most of it) that was being handled by those processes is available to equivalent processes on a separate physical server.
With my limited experience in Elixir (close to none), I was thinking something about these lines: handle all run time information in memory, have a second level of information in mnesia which, I believe, should be able to replicate that information across different physical servers. Finally, to have a third-level information storage that could be a database (relational or not, I have not yet decided) which could be used for more permanent storage to run historical reports and perhaps basic configuration information. This DB could be in the form of a DB cluster (e.g. Oracle RAC).
How does something like that look? Am I firing my shots to the dark here?
First of all you are not talking of a HA system here, but a fully consistent one. You want to never lose a request in flight. That is not what an HA system is. Now the real question is, are you ok with retrying the request after a timeout on the client ? Or to give wrong data back ?
That. You usually start by going to the business, having “the talk” (no, you cannot have availability and consistency), and then start asking them to make calculations for the economic tradeoffs: what’s the cost of not processing something? What’s the cost of processing something twice? And then some engineering data: how often will it happen? What’s the cost of reducing the “how often will it happen”? Once that is all figured out, you can start talking about solving it in Elixir.
For all practical purposes you can have C & A by creating the underlying infrastructure to support as many .9 as is required by the problem. We need to stop pretending that software is the only variable in the equation . If I run an AP system with all nodes on the same switch and that switch dies my AP system is not gonna help me much.
Indeed, that’s the answer on the “What’s the cost of reducing the “how often will it happen”?” question I posed. Heart of the matter is that you start by an money/engineering analysis, not by writing Elixir code
You are right I guess I let my bias creep in and have not read your post as carefully as I should have. I just see too many instances when people are trying to solve things at application layer that should be solved by proper hardware/infrastructure choices.
I apologize I just came back today to this post. I thank all of your for your replies and suggestions. I read my original question several times and I am not sure I deserve the beating . In any case, my question is aimed at finding theoretical possibilities to understand how the design can be better done.