Replicated Cache/DB in elixir/erlang? What's the best solution? Use case inside

pejrich · March 22, 2019, 11:00pm

The short of what I want is a replicated cache that anytime a new node joins the cluster (via libcluster), data from the existing cache on another node is copied to the new node.

Example
Node A has the data.
Node B joins cluster.
Data is copied from A to B.
Node A leaves cluster.
Node C joins the cluster
Data is copied From B to C

Also if C joins before A leaves, I don’t really care if the data comes from A or B.

This is not mission critical data. It’s just caching a temporary check-out session, so I don’t need any crazy consistency.

Does this exist? What tools would be best to use?

cohawk · March 23, 2019, 1:51am

I am not sure if @sb8244 handles dynamic nodes added or removed from the cluster:

https://stephenbussey.com/2019/01/29/distributed-in-memory-caching-in-elixir.html

Same caveat as above, but do not think these 2 libs handle replication out of the box:

https://whitfin.io/cachex-v3-1-and-the-return-of-distributed-caches/

https://medium.com/erlang-battleground/distributed-caching-in-elixir-using-nebulex-9af589186caa

keathley · March 23, 2019, 2:26am

If you have a limited number of nodes, the nodes don’t change that often, it’s really just a cache and you don’t mind lost writes, and you have a limited amount of data to replicate then you can probably do something naive. The most basic would be to listen for node up and down events. when you detect a new node has joined you can replicate your data to them. You can probably use pg2 to maintain a registry of cache processes. You can replicate to those processes and have them write into an ets table or something similar. This approach is very naive and overly chatty. Plus it’ll definitely result in lost writes eventually. But it may be ok for your use case.

If any of those caveats are deal breakers then you may want to use an external process like redis or memcached or pull in a more complete solution like lasp or erleans. Otherwise you’re gunna find yourself solving some pretty gnarly distributed systems problems .

sb8244 · March 23, 2019, 2:40am

This solution I presented is definitely on the naive side. The advantage of the is that it’s easy to grok and debug, but Chris’s points are very true.

I found that the more complex tools were too much for my simple cache case and went with pg2 + cachex. This does handle dynamic node membership because the cache process sends out a request for cache load when it starts. Lots of room for improvement but a decent starting point for the naive solution.

Qqwy · March 23, 2019, 8:17am

So you only want to use this as an ephemeral cache?

It might be worth looking into Phoenix Prescence, because it also fills in this need somewhat.

sb8244 · March 28, 2019, 12:46pm

That feels like it would be a misuse of Phoenix.Tracker (you wouldn’t want presence in this case). Tracker is great for monitoring processes across the cluster and can accept metadata, but isn’t the best choice if you don’t need process tracking.

Presence uses CRDT which are available standalone as well, although they can have performance trade-offs so make sure to throw scenarios at it before committing.

stavro · April 10, 2019, 12:45am

I was following the work being done on FireNest (https://github.com/phoenixframework/firenest/pull/21) - with a branch having some work on seemingly generic state replication. Unfortunately that project has now been archived. Hopefully we’ll see some movement in the replication arena from the core team again.

dimitarvp · April 10, 2019, 8:06am

From the latest commit there:

Work on firenest has been halted. We plan to incorporate some of those features in Phoenix.PubSub instead.

quatermain · August 13, 2021, 6:32pm

hi, any new experience with this issue after 2 years? Or ideas about good solution for this kind of use case? Have you considered Mnesia? For example Memento is Elixir wrapper around Mnesia.
Did you use Cachex? Any feedback?
I have similar issue and I’m trying to get feedbacks/experience from users of these solutions.

sb8244 · August 13, 2021, 9:23pm

FWIW, I used the Cachex + pg2 solution and had 0 issues with it in production since it shipped. I’ve since left that company, but my understanding is that it’s just churning along doing its thing.

So that’s a good endorsement for me, although it depends on your use case.

dimitarvp · August 13, 2021, 9:34pm

To be fair I’d just go for memcached or Redis and forget about it. They both serve this goal excellently.

If there is no distribution involved I’ll go 100% Elixir but when it gets to 3 or more machines my priorities change.

quatermain · August 18, 2021, 11:30am

I’m talking about 4-6 apps, pods in k8s. Yes, 3rd party/external service for it good solution. But it has to be fast and also we have issues with these external services. For example my use case is about loosing dependency on DB(Postgres) so we can be fully (most important features) available even when DB is not available. We are preparing for new client who will use api and some web endpoints with rate about 300 r/s + other clients. So downtime a few seconds can be huge problem for us. But I don’t have budget to get some huge solution for it. Even, I don’t think I need black box solutions when we have power of BEAM.

dimitarvp · August 18, 2021, 11:39am

I agree that in-process cache (:ets) is the super-performant solution. But I’m not sure how good the BEAM is if you want to have a distributed ETS cache. Hence why I proposed other programs.

Redis is rock-solid, what problems did you have with it?

Additionally, be extremely careful with k8s. If you don’t have 10+ different apps and storage services with management of private networks and special-tailored infrastructure then the chances are 99% that you don’t need k8s.

You would likely be better served by Nomad or Docker Swarm.

quatermain · August 18, 2021, 12:16pm

Redis is rock-solid, what problems did you have with it?

redis and other tools are good, but cloud service where we have to run it have some issues. K8s works, let’s say works without huge issues. But other services has outages. And running own service in k8s is first expensive because you have to have bigger resources for k8s itself and it still doesn’t resolve whole issue.

Only one reason why we use k8s, we have 2 envs(prod, non-prod) and together we runs about 30-40 elixir apps there. So we use it only because of orchestrations. We use flux so gitops resolve a lot of issues for us. But I really hate it. We used docker swarm 4 years ago and it was bad. I’ve never used Nomad, so not sure about it. We don’t have own SysOps department so we use managed solution from IBM. IBM is only one allowed cloud provider by our regulator. We can order some SysOps company/contractor to create some infra but it has to be maintained and we don’t want to care about it. So looks like we don’t have another options. We can not use AWS btw.