Internal Elixir Caching Alternatives to Redis in Multi-Node Setups

keatz55 · November 12, 2024, 3:47pm

For Elixir users who replaced or opted out of Redis caching in a multi-node setup, what internal Elixir-based caching approach did you use?

dimitarvp · November 12, 2024, 4:21pm

Do you mean simply a single-node solution enabled by libraries, or do you mean whether we rolled our own multi-node solution?

keatz55 · November 12, 2024, 4:29pm

@dimitarvp Thanks for the response. Apologies for the lack of clarity. I mean you basically removed redis, or never used it, and rolled that functionality into your existing elixir nodes via nebulex, riak core, or your own solution.

dimitarvp · November 12, 2024, 4:52pm

Thank you for following up, here’s my take that has proven unpopular at times:

…We simply gave up the idea of a distributed cache altogether. To me and the several teams I’ve been a part of that needed this (or more accurately: believed they needed it) it turned out to be simply wasteful; we viewed it as replacing one DB with another so we just opted for per-node caches. Solutions varied, between naked ETS, to cachex, to ane, and even to our own Agents / GenServers holding on to specific data states.

Interestingly enough, this almost did not change our stats of cache hits / cache misses. It did at places but there we opted for pre-warming caches on each node (based on statistics about which keys were hit the most) and then we practically could not find difference in the hit / miss ratios pre-Redis and post-Redis.

So if you were to ask me if I was your consultant, 99% of the time I’d tell you “You don’t need a distributed cache, use in-process per-node caches”.

Disclaimer: I worked on projects with substantial traffic but not on the scale of, say, Amazon or Walmart. At those scales there the actual single-source-of-truth caches probably make sense. But I found that for everything below that they don’t.

keatz55 · November 12, 2024, 6:06pm

@dimitarvp Thanks for the context. Agreed, definitely not needed in most cases. However, there still are cases that do warrant. I’ve used riak core to replace redis-specific functionality in elixir nodes in the past. It worked great, however, rolling deploys are slow/brittle. Really just trying to have an open discussion if anyone else has done this and what their setups looks like.

cevado · November 12, 2024, 6:15pm

my go to solution for in memory key value store is nebulex, being it a multi-node solution you’d be required to cluster your nodes.
if you use redis for pubsub, i’d rather use phoenix pubsub with pg for example.

but it’s good to keep in mind that “caching” is a broad umbrella term that can mean a lot of things depending on context and solution that you want.

cevado · November 12, 2024, 6:24pm

i’d really avoid riak-core-lite if partition is not a requirement for you. for cachex and nebulex you don’t need rolling deployment if you’re replicating the entire cache in all nodes.
afaik you only gonna need rolling deployment if you’re dealing with partitions, bc every new node gonna require that the partitions redistribute data between them and that is a slower process in comparison with just dumping the entire ets table to another node.

keatz55 · November 12, 2024, 7:25pm

@cevado Appreciate the response and thanks for sharing your setup. Though rare, I am referring to partitioning scenarios for caching/kv purposes. Nebulex is great because it does have a partitioning adapter, however, if a node goes down those key/values are gone. riak avoids this with a replication factor, which allows for failover while still minimizing memory usage w/ partitioning. riak uses paxos which has limitations that add to rolling deploy slowness/fragility. Scylla DB (not a cache, but still a dist system) opted for raft in order to increase consistency/performance. I’ve been tempted to develop a dist kv/cache implementation that uses raft leader election and measure differences. Still at the mercy of how fast state can rebalance though, like you said. Ultimately just curious what others are doing so thanks.

cevado · November 12, 2024, 7:57pm

one that I know also uses raft but i personaly never used it, is khepri, but their doc doesn’t metion partitioning data.
i’m think that for your scenario riak-core-lite is most probably the best solution to have inside your own application. maybe a solution is to have riak only on specific nodes(reducing the size of the riak cluster, this way only a few nodes would require a rolling deploy) and getting data from them using rpc in the cluster(this would reduce some encoding and some network time compared with using an external tool like scylladb).

but at this point given your constraints it might be time to consider really an external tool like scylladb. i think that if you feel like you’re fighting with the solution all the time, as signal that the solution is not working in your favor anymore.

keatz55 · November 12, 2024, 8:21pm

Apologies for lack of clarity, I wouldn’t use Scylla. I’m just referencing how they opted for raft for consistency/performance instead of the gossip/paxos implementation cassandra uses. For greater context, scylla is a rewrite of cassandra in C++. I was just mentioning how I’m tempted to develop an elixir solution that tries to get more performance using a different consensus algorithm. Also, this isn’t mission critical. This is more for fun and seeing what other folks do or are experimenting with.

Khepri looks interesting, thanks for the share. Also, agreed separation of concerns in a cluster can further increase performance. Thanks for the feedback.

dimitarvp · November 13, 2024, 9:03am

Somebody just posted findings about ETS vs. Redis caching: Elixir Blog Posts - #1183 by ananthakumaran