We have a stateful Elixir service at work running in a Kubernetes cluster which keeps some user related ephemeral data in processes (think of it as a session process). At the moment we lose this state on a redeployment, which hasn’t been an issue yet. We do plan to keep additional information in there which is why we now want to investigate on how to keep this state around beyond a deployment.
The current idea is to run multiple instances of the app and also sync the process state into an in-memory data store. If a node then goes down - for example due to a deployment - new processes can be spun up on other nodes which rehydrate their state through this synced state.
I’ve spent some time looking into our options:
- mnesia
- riak_core
- redis
mnesia and riak_core both have the charm that they run alongside our application without needing to spin up something separate like in the case of redis. Now I have some thoughts on each of these options:
mnesia
From what I’ve read so far mnesia seems like a solid choice as it’s comes with OTP but has one caveat: it has no built-in support for handling split-brain scenarios.
Since we’re running this app in a Kubernetes cluster and not on a telephone switch with a shared backplane, split-brain scenarios are not something we can ignore. This is not necessarily a deal breaker for us but it makes the next contender much more interesting.
riak_core
My current understanding is that riak_core is what powers riak KV, riak TS etc… It’s also better equipped to handle a split-brain scenario compared to mnesia is.
What has been confusing me though is: how the hell do I use it?
- there is the “official” riak_core repository which seems to under active development but can’t be found on hex
- there’s also a fork (riak_core_ng) which can be found on hex but development seems to be stalled
I’ve tried to install riak_core_ng
but immediately ran into issues as it depends on a fork of poolboy (version spec ~> 0.8.4
) and we’re using the “real” poolboy at version 1.5.2
. We’re also running on OTP 23 which I expect to give us trouble as the latest release of riak_core_ng was mid 2018.
redis
Last but not least. Redis is a pretty straight-forward choice but kinda makes me sad. We can’t run it inside the same BEAM instance and it would require encoding and decoding our state (which should be manageable with erlang:term_to_binary/1
but still).
Right now it seems like the simple choice though, and I like simple.
I’d be interested to hear your perspective on this, big bonus when you have actual prod experience to back it up.