Best options for Elixir-only persistent distributed cache?

budgie · July 27, 2025, 1:00pm

Building something that uses realtime features with liveview. There’s state that’s managed and pushed to call participants. Right now I’m using Cachex but if I do a deploy while someone is in a call, all the state will be lost?

What’s the best way to have temporary in-memory storage that persists across restarts without resorting to an external service i.e. memcached or redis?

dimitarvp · July 27, 2025, 1:06pm

Got no answer for you at the moment but why are you using an ephemeral state that can be lost at any time (cache) as a source of truth / database?

budgie · July 27, 2025, 1:14pm

It’s not a source of truth really. It’s for temporary call/session data that doesn’t really make sense to store after the call finishes.

Schultzer · July 27, 2025, 1:20pm

Mnesia is the answer.

lud · July 27, 2025, 1:23pm

Depends on how you deploy.

You can build a distributed cache that synchronizes agressively, and when you shut down a node to roll a new node, the stopping node stops accepting external input but will wait till all its currently processing cache items are synced on other nodes, and the new node will not start to accept input before being in sync with all other nodes.

k8s can help with that, but it’s probably a lot of pain to do properly.

Otherwise you can use PubSub : all cache writes go to pubsub and are distributed to all other nodes. When a new node start, it subscribes to pubsub without processing messages yet, it selects another node and gets a full copy of the cache to bootstrap the cache locally, and then processes incoming messages. Writes must be time-stamped to ignore old writes (aso because the node you are copying from is already listening to messages sent during the boostrap sequence and the new node will receive them too. If you start listening after copying the cache you will miss some messages.).

All of this looks hard to get right, are you sure you cannot use Redis?

nerdyworm · July 27, 2025, 3:48pm

The route i would go is redis, memcached, or postgres. The alternatives are actually much more of a pain in comparison. They are quite fun though, so if your project needs a little pizzazz then read on.

If you are running more than one node:
GitHub - rabbitmq/khepri: Khepri is a tree-like replicated on-disk database library for Erlang and Elixir. - raft replicated, happily syncs with all nodes, handles cluster restarts because it persists the data to disk. If the library’s tree style doesn’t suite your problem then you can use the raft library directly: GitHub - rabbitmq/ra: A Multi-Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system. There is a simple key/value store example in the docs.

The next option is mnesia. This option works well it you have like two nodes and are not doing rolling k8s style deploys. It’s easy to get started, but I put it in the more difficult to handle real world disaster issues. See rabbitmq’s docs/history/issues on how mnesia fails in production to give a better understanding of the… rabbit hole

Another option is to just dump your cache to disk or s3 storage before the machine stops and then restore it from disk/s3 when a new machine starts. Simple, but the down side is that you’ll eventually have some inconsistent data to deal with. I’d consider this route if your cache size is small.

It could also be possible to just use Phoenix.Presence — Phoenix v1.7.21 or Phoenix.Tracker — Phoenix.PubSub v2.1.3. Pretty sure if you do a rolling deploy you’ll be able to maintain any active cache state there… but you’d have to see if that works for your use case.

If you are running a single server or the caches don’t really need to be synced between two physically different nodes then sqlite may be a good fit for surviving restarts.

This is actually one of those, “elixir/erlang should make this problem simple”, but what happens is that elixir just fast forwards us to the realization that it’s very difficult problem with no on size fits all solution.

Let me know if this was helpful. I’m throwing a bunch of stuff at the wall to see what sticks.

Cheers,
Ben

budgie · July 28, 2025, 10:57am

@nerdyworm @lud

Yeah thanks for the reality check guys. I think redis is ultimately what I’ll go with. Love the idea of having them replicated cluster-internally but don’t want to have to add persistent block storage or a volume mount to the app containers. Keep Cachex for read-through queries to the DB probably.