Caching: ETS, Mnesia, Redis

mykewould · June 23, 2017, 8:58pm

Hello,

I wanted to walk through the my caching experience in Elixir and I’m hoping for any general feedback from the community around their experiences or thoughts on these systems.

My initial use case was pretty simple. We were running a single instance Phoenix application that was hitting a bottleneck on the database. The biggest load issue we were having came when we did complex searching on the database with 1000+ req/s, a lot of these searches returned similar data so we knew we could cache that data once it got returned and save the database subsequent hits. I think this is a pretty common use case. You have a database, you don’t want to do thousands of round-trips to your db in under a second, you want to cache as much as you can within a second, minute, hour, etc (however long makes sense). I found ETS, specifically ConCache, to be a great tool for the job here. I was able to cache significant amounts of data and eventually the load was CPU bound on the Phoenix server, rather than throughput on the DB.

ETS solved the problem as long as we ran a single instance of the Phoenix server. The issue was that we didn’t always run a single instance, the company I worked for used a Docker swarm system to manage and deploy multiple microservices (using different languages) and they preferred to scale the system by increasing the instances of the Phoenix application. Once this happened each instance had its own ETS caching, and depending on which instance was hit the data could be cached or not. The results were still better than no caching, but not nearly as good as they could be if they shared the same cache. The answer here, and especially since other teams (using different languages) wanted to share the cache, was to switch to something like Couchbase or Redis. Which is what we ended up doing.

Putting aside the different languages problem for a second, I’m curious if anyone has done shared caching with ETS or Mnesia specifically around the typical database use case? At that point is it better to reach for a tool like Mnesia? I know Mnesia is built for distributed caching, but it comes packed with so much functionality I wasn’t sure if I was reaching for a gun in a knife fight. I’d love to hear any general thoughts/musings/approaches on this.

If anyone has any experience with keeping shared caching in the BEAM ecosystem I’d like to hear about it. Also curious to know if anyone has done shared caching with BEAM tooling allowing access to other programming languages in a microservice framework. Seems like it would be a lot of work, when something like Redis would be easy to implement out of the box.

OvermindDL1 · June 23, 2017, 10:33pm

In my general usage I have 2 cache’s:

Local Cache: This is like ConCache like you are doing.
Remote Cache: Materialized views in PostgreSQL tuned for simple key lookup

Beyond that I’m not sure what is the point in adding something else like Redis. Either you want data pretty often in which case a local cache is fine, or you are grabbing something expensive that you’d prefer be cached, which a materialized view is awesome for. Something like redis is like a local cache (I.E. something you want often), and yet you still have a network connection, which seems to defeat the point to me…

minhajuddin · June 23, 2017, 11:01pm

We use redis to cache a lot of data from searches. And have been running into issues because of it. The main reason being that we end up calling a lot of redis commands to load the data and the serialization/deserialization takes a good chunk of time too. This even after using Redix’s pipeline feature. My advice would be to use ets tables, it is fine if the data is duplicated (that is if you can afford the memory).

Another thought that I have toyed with is connecting the phoenix nodes and sharding the cached data on ets. This would highly depend on how you structure your data. But the general idea would be:

Shard on some key in ets, which gives you a node reference
Dispatch an MFA to that node so that the actual computation happens there
Get back the result from the above computation and use it

mykewould · June 24, 2017, 12:02am

Ok, so you do something close to this (links below) approach to create and leverage this functionality within your Elixir applications? This assuming you are using a data store that has materialized view functionality.

https://medium.com/@kaisersly/callbacks-in-the-database-postgres-5bb4ae3f4fd9
https://medium.com/@kaisersly/materialized-views-in-ecto-8887bc89efa5

How are you accounting for typical data expiration issues? For instance if I have data cached on instance y, and then I update the data through instance x depending on my caching strategy I either have to have x tell instance y or I have to wait until instance y renews cache. I may be missing something. This is where shared caching was helpful. In our specific case we weren’t using a datastore that allowed for materialized views, although I can see how that would be ideal.

This is really interesting. What use case do you have in mind for this? Have you looked into Mnesias table fragmentation for this implementation?

jeremyjh · June 24, 2017, 12:09am

Its surprising to me that there is no killer solution native to the Erlang ecosystem already. We should be able to crush this without resorting to memcached or redis. The closest thing I found (I was looking more for distributed session store) is libm_kv - but it does not have TTL functionality presently.

mykewould · June 24, 2017, 12:31am

I agree. Looks like that libm_kv uses Mnesia under the hood. I’d love for someone with experience around Mnesia to jump in here. I keep thinking that the answer to some of these issues is Mnesia. For instance I found this Mnesia distributed caching pattern implemented in Erlang and I implemented a similar solution in Elixir here. It’s very simple, but I’d like to know if Mnesia is even the right direction to go in for something like this. What are its limitations compared to something like memcache?

cmkarlsson · June 24, 2017, 2:19am

Hi Michael,

We’ve used mnesia as a distributed cache quite a lot and it is good at
it. However this is from an erlang code base and not elixir. I’ve heard
mnesia is not quite as natural to use in elixir due to use of records.
In the simplest case like a cache you can use it fine
with just tuples, i.e only key/value instances in the database and don’t
have to define any records.

Potential pain points with mnesia.

Adding/removing nodes are not seamless and you have to write a little
bit of logic to it.
The more nodes you have the slower writes are. If you have an almost
static number of nodes (i.e not adding/removing nodes all the time)
and not too frequent writes it is fine. Writes are not super slow but
they can become a bottle neck. I think we had something like 3000 w/s
on a three node cluster.
With distributed mnesia you run the risk of having node-splits which
sort of can be handled (have a look at uwiger/unsplit) but the best thing is
if you can afford to re-populate data from the database in case that
happens.
If you have lots of data, when starting up mnesia sometimes (or
everytime? can’t remember) and the data it holds is deemed out of
data it copies the entire data set from another node). This can lead
to bursts in traffic on startup.
Read time is blazingly fast (like ets). This can’t be achieved with
any external data cache
You need to keep the cache data-set in memory. Mnesia do support table
fragmentation so you can potentially spread it out over multiple node
but it is quite a manual task to implement.

So a quick re-cap. If your data is temporary and can be recreated from
another data-source; You don’t have lots of nodes being added/removed
and you are on a stable network to avoid netsplits mnesia shines
as a distributed cache to your database

Cheers,
Martin

jeremyjh · June 24, 2017, 3:08am

Thanks cmkarlsson. What attracted me to libm_kv is it handles a bit of the plumbing you mention - assuming you want your data replicated to all nodes it will ensure data is copied as nodes join and will re-achieve consistency after a netsplit. But it does not seem to support partitioning/sharding (even though mnesia does) and caching use cases need both TTL and LRU pruning.

It seems like there is an opportunity for a project in this space and I do think basing on mnesia is the way to go.

DianaOlympos · June 24, 2017, 7:23am

I may be completely dumb but… Why not stop thinking in term of stateful app ?

Link all your instance together inside a cluster, and the difference between local and non local cache disappear. You can use something like riak_core as you key holding cacheholder and bob’s your uncle.

Other solution : do it in the db, which is also interesting.

Final and real solution : use the db as a cold storage instead of a constant source of truth and keep the data and the work on the data in memory in the app. The db is then a snapshot/save and not your main thing anymore, so you get rid of cache invalidation problems.

At one point, once you use distributed app and you have cache invalidation problems, it probably means that your architecture is wrong…

stateless considered harmful

cmkarlsson · June 24, 2017, 7:49am

stateless considered harmful

I don’t know if you meant to say “stateful considered harmful” but
considering erlang/elixir excels at running stateful applications that
stay up you might have made up a new slogan for elixir advocates

Qqwy · June 24, 2017, 8:53am

I think @di4na is pointing at the fact that servers in many languages fire and forget all the reauests that they do, with the only way of keeping state being in the database, therefore tightly coupling your DB and your application and your application often needing to do many requests to the database over and over again for each requests.

Caching tries to somewhat reduce the symptom of your application becoming slower as a result, but it is not a treatment for the underlying cause: the database being considered the single source of truth and therefore always needing to keep the DB ‘in the loop’.

dom · June 24, 2017, 9:48am

Did you actually try this? I played with riak_core before, but found it extremely low-level, it’s more like a toolkit to implement a database from scratch than something meant to be embedded inside a regular application. It’s also unmaintained and won’t run on OTP-18, which is the minimum requirement for Elixir.

DianaOlympos · June 24, 2017, 12:57pm

There are versions by project FIFO that does run on OTP 18+

And yes if all you need is a distributed key value store in memory, it will work. Of course it can do more, but if all you need is to spread some ETS table or stuff like that, it is not hard.

OvermindDL1 · June 24, 2017, 1:56pm

I’m using PostgreSQL. I use materialized views to cache data that is expensive to calculate but is often needed, like I have a query that joins 12 tables (not my design) that takes a good 2 seconds to complete (better then the original SQL they gave me, by far!), I use a materialized view to cache it with a simple key lookup that auto-updates based on how the back-end data updates. This would not be a good use of caching in ETS or Redis as they would not auto-update based on the data changing and when it elapses it hits that 2-second call again, so I do it pre-emptively instead. PostgreSQL is awesome, I’m not sure why any other database would be used for ‘almost’ anything.

I use CacheX (it had more useful-for-me features than ConCache at the time I chose it, unsure how they both rank currently) and it seems to crush it well for me.

Mnesia is perfectly fine as a ‘distributed’ cache though, but I really just have each node manage it’s own cached data 95% of the time so I don’t have to deal with distributed netsplits like you’d have to with Mnesia or a distributed Redis cluster (and if using only a single Redis, hell just use PostgreSQL anyway).

mykewould · June 24, 2017, 6:46pm

Martin,

Thank you for sharing your experience.

Why did you or your team opt for a distributed caching system with Mnesia? What were the specifics that you wanted with Mnesia that other solutions couldn’t provide?

Do you have any examples or pointers for this? My simple lib joins nodes together, but only as nodes start up and only if it has a master list of nodes. It’s not robust, and it feels clunky/naive.

jeremyjh · June 24, 2017, 6:58pm

Yes for a lot of applications - even ones with lots of nodes - this makes sense - just let each node build and manage its own cache and you’ll still have lots of cache hits. But that isn’t a distributed cache. cachex does look like an excellent system.

My original interest in this space was a distributed session store - which is another use case where people turn to memcached and redis today and that is why there are Plugs to use both of those as backing stores. For this use case there must be strong consistency. I don’t see any great option for it. Building something backed by mnesia or ets - possibly leveraging riak_kv - does seem worthwhile to me.

OvermindDL1 · June 24, 2017, 7:18pm

PostgreSQL is what I use there. ^.^

cmkarlsson · June 24, 2017, 10:51pm

Why did you or your team opt for a distributed caching system with Mnesia? What
were the specifics that you wanted with Mnesia that other solutions couldn’t
provide?

We wanted our application to be self-contained. We ship it to customers
who sometimes have limited technical ability and the absensce of an ops
team.

Also, we wanted as fast response time as possible and if you hvae to go
outside of the erlang vm memory space it adds latency.

Do you have any examples or pointers for this? My simple lib joins nodes
together, but only as nodes start up and only if it has a master list of nodes.
It’s not robust, and it feels clunky/naive.

That is sort of the problem with mnesia and adding nodes It feels
clunky naive

We have a “cluster” which has to be initialised. To join a node a new
node tells the cluster to add them.

:rpc.call(cluster_node, :mnesia, :change_config, [:extra_db_nodes, [node()])

Then we persist the schema on the new node

{:atomic, :ok} = :mnesia.changes_table_copy_type(:schema, node(),:disc_copies)

In addition to this we have a “cluster” table in mnesia which keeps
track of the state of the cluster. It works for our use case but as you
say it is clunky and generally cluster need to be in good state to do
operations on it.

To remove a node you first need to stop mnesia on the node to remove and
then you can use:

:mnesia.del_table_copy(:schema, node_to_remove)

hwinkel · November 26, 2017, 4:51pm

does anybody ever have looked to https://github.com/NetComposer/nkbase which is a more high level DB based on riak_core.

OvermindDL1 · November 27, 2017, 5:19pm

No updates in 2 years… Is it ‘done’ or is it ‘dead’?