How to use ETS as a global cache?

Background

I have an Elixir application that has 2 machines. Each machine receives requests on a round robin base.

Both machines are connected to a single Redis instance, which works as a global cache.
If machine A receives a request, it caches the request/response in Redis. Should machine B get the same request, it won’t need to re-calculate the answer.

Questions

I know that ETS is usually used for caching.

However this cache is local to machine A and local to machine B.

My goal here would be to replace Redis with an ETS instance and achieve a global cache for my elixir applications using ETS. Thus far I was not able to find any article detailing configurations for such, so I wonder:

  • Is it possible to use ETS as a global cache via HTTP, the same way Redis is being using in the above example?
  • If so how can I do it?
  • If it is possible, is it worth the effort, or does the community advice I use Redis instead? (maybe because it is easier to setup, for example)

It’s possible. You could always build a http endpoint around ets and server http requests that way. But none of that comes “built in” and I doubt it’s worth it in the general case.

On the beam you also have mnesia which allows you to share kind of an ets table across the cluster, but it’s not quite straight forward to use. That’s local state synchronized across the cluster and not accessing one shared resource.

Without much detail on why you need a global cache I’d probably suggest staying with redis, lacking any reasonable means of doing a tradeoff with other options.

2 Likes

yes, for sure.

If machine A and machine B are both elixir machines, you could use either Cachex or Nebulex, both takes advantage of a cluster of machines to distribute ets tables across nodes

I particularly think that if you have an elixir application is more simple and less expensive clusterizing your nodes and use ets than use redis.

https://hexdocs.pm/cachex/distributed-caches.html
https://hexdocs.pm/nebulex/Nebulex.Cache.html#module-distributed-topologies

If you don’t want to bring an external dep in, the BEAM ships with Mnesia which sits ontop of ETS and adds a consistent distributed database.

However, when two nodes stop communicating due to a network fault Mnesia can be tricky to recover.

This is what http_cache_store_disk and http_cache_store_memory do when cluster_enabled is set to true: they exchange cached responses by using distributed erlang and store them either in ETS table or on disk (and in this latter case, metadata is still stored in memory).

They take into account that nodes can have different requirements such as available disk space or memory and therefore they handle cached response autonomously. A cached response can be discarded from one node and still be available on another.

If you want to cache HTTP responses from Phoenix / Plug, you can take a look at plug_http_cache.

4 Likes

Is fetching a cached value from another node really worth it? Is it crushingly faster?

I have my doubts. Would love to see actual metrics.

2 Likes

So your hypothesis is that a global Redis instance will always be faster than a Cachex cluster, correct?
I am not sure how the cluster works, but if data is replicated in all the cluster machines, the speed should be faster, although you now have other problem to deal with, such as the split brain.

1 Like

What about running separate caches on both instances? You will waste a little bit more ram, however you will get rid of a lot of complexity, which IMO is perfect, as ram is dirt cheap these days.

3 Likes

No, my hypothesis is that having a separate local cache for each node is going to work best. People really over-optimize for the wrong things I feel (and some of my practice has proven this hypothesis correct; accentuating the “some” here). Having 3 separate nodes doing a DB query cache each is honestly not a big deal at all, especially if the cached DB query cache should last at least 10 seconds; even in these conditions a local cache is more than enough.


To me the whole idea of “fetch cache from the network” is just hilarious in general, even though I’ve witnessed cases where it was still worth it (we’re talking results from SQL materialized views that were taking 10+ seconds to calculate; in all other cases however, distributed cache is just technology triumphing over itself, and over common sense as well).

1 Like

No need to fetch from the network upon user request if each node has its own local cache, and preemptively fills its cache with data from other nodes :wink: The only role of a cache, after all, is to be full, and with the “hottest” objects (HTTP responses, …).

This is what http_cache_store_* do when clustering is enabled: each node aggressively downloads HTTP response it doesn’t know to become full quicker and have the latest objects. Nodes also query 1_000 objects (by default) from other nodes at startup to warm up.

So how does that work exactly? At app startup one node warms up its cache and all other nodes download that warmed cache from it before they start?

1 Like

In the first 20 seconds after an http_cache_store_{disk,memory} starts, whenever another node shows up, a warm_me_up message is sent to the node that shows up, and the latter sends the n hottest stored HTTP responses.

Result will depend on your deployment type. If you’re doing rolling deployment then nodes still up will feed new nodes with HTTP responses, and the former nodes will benefit from the same mechanism once they are eventually restarted, receiving back the HTTP responses they sent to the first restarted nodes.

2 Likes

This is a super useful explanation, thank you.

Last question: do the nodes sync caches after the warm_me_up message has been processed and the hottest requests cached?

BTW I’d definitely put such a higher-level and very accessible explanation somewhere in the docs. Very often caching libraries / programs don’t make these things explicit and I usually end up skipping them for that exact reason.

1 Like

They constantly do that: each node caching a new HTTP response sends the {CacheKey, Timestamp} to all other nodes, which request the full HTTP response if they don’t know it.

Thing is that doc are scattered between different libs (http_cache, plug_http_cache and http_cache_store_disk for instance), but I’ve described overall capabilities in a blog post.

1 Like

Hm, here is where I question the usefulness of this. It’s a full HTTP request to warm up a cache that might take less time being fetched from the DB, so to me it gets a bit crazy at this point.

Admittedly I’ve rarely worked on systems that truly derived value from caches – still, almost never in my career were distributed caches useful. It was at least as efficient to have 3+ nodes each having a completely local RAM cache and hitting the DB and/or 3rd party APIs to populate those local caches. We even had metrics dashboards proving it.

This is why I am super curious about people using distributed caches: do you have metrics showing “time used to cache the thing” vs. “time to distribute the thing in caches” vs. “time to access the cached thing”?

Just to make things clear: if the response is not cached on nodeA, it’ll hit the backend (DB, etc.) and not request the response from other nodes.

Cached HTTP responses sharing is done behind the scene, between the stores. HTTP response sharing is best effort, as everything else (no guarantee a cacheable response will be cached - see the blog post).

Now clustering is disabled by default, and indeed I’ve no feedback from it being used in prod so I don’t know for sure this response sharing is really useful. My experience in HTTP caching is with an very slow MongoDB/Elasticsearch BE and the more responses are cached, the better.

2 Likes

I guess the best thing out of this is that you can warmup fast a node that was restarted/redeployed, can be quite useful if you are doing periodically small deploys (if the redeploys don’t happen at the same time on both nodes).

2 Likes

weird, both documentations of Cachex and Nebulex explains exactly what to expect from their strategies to distribution. Cachex have a detailed explanation on what happens localy and what needs to go on the network. Nebulex include different strategies to accomodate different needs.

For exemple, an explanation of replicated cache from Nebulex, and it clearly explains that distribution happens on writes and that it replicates all values:
https://hexdocs.pm/nebulex/Nebulex.Adapters.Replicated.html

I’m not sure if you’re talking about caches in elixir or caches in general. :thinking:

2 Likes

Oh I meant distributed caches in general, plus @tangui’s libraries. Thanks for the link, I’ll read through it.