Need help on setting up Master/Slave Elixir service architecture

nikhilbelchada · September 14, 2019, 4:44pm

Hello Community,

We are relatively new to elixir. So far we have couple of standalone services running on production and they are performing exceptionally well

Now moving forward to next service we have this use case for which couldn’t able to figure out the right way to implement this in elixir.

Here our service act as communicator between the third party source and our kafka stream, where we receive data from third party via TCP and we push data to kafka to be consumed by other services. This third party source has 2 instances, one as primary and second as backup, which we can use in case we don’t receive data from primary. The communicator service will be deployed on 2 different instances (i.e 1st instance will receive from primary and 2nd will receive from backup), so in this case if we start both communicator machines, they both will emit duplicate data to kafka which we don’t want. We want only one to be able to push to kafka and on failure or so, second should push data to kafka.

Here we want something like master slave cluster setup for our service, where master will push data to kafka and slave will be in pause state and as soon as master goes down, slave will start pushing data to kafka.

Possible solution which we thought is to use:

keepalived - to have active-pasive setup
Keep state in Redis and they use it to decide whether service should push data further or not. (this is very bad solution though)

We think there will be better way to deal with this clustering use-case using Erlang distributed architecture.

Thanks!

tty · September 15, 2019, 9:30pm

I can think of two possible solutions. Both solutions are variations on your keepalive idea:

Distributed Applications. IMHO this is mis-named as it describes a classic active/passive failover.
Global GenServer i.e. registering the GenServer globally.

In Application:start/2 pattern match on :takeover and :failover to determine which source to pull from.

Likewise for the global GenServer solution you can determine source via sys.config or node name.

al2o3cr · September 16, 2019, 3:30am

Nodes don’t usually send out an orderly notification of their demise; what would the system use to detect “liveness”? What happens to messages that arrive before the master is declared “down”?

What happens if the system goes split-brain (due to weird local networking failures, for instance) and both instances are pushing events into Kafka?

Deciding when & how to handle failover is the hard part, the core “receive TCP and put in Kafka” logic is straightforward. You’ll want to think carefully about exactly what the goal is - keep the data flowing when the upstream primary goes down? Keep the data flowing when the primary instance of the communicator service goes down? Both? - and what guarantees that goal requires from the system.

nikhilbelchada · September 16, 2019, 4:53am

Thanks!!
I was not aware of Erlang Distributed Applications and Global GenServer. Will give it a try

nikhilbelchada · September 16, 2019, 4:53am

You have a valid point Will think on that front as well.

For us, when switch over happens it fine for us if both master slave instances sends data to kafka for some duration (may be 3-5 min but not more than that). So we wanted mechanism to reliably do switch over in max 5 min.