GenServer/ETS/Rustler performance questions

Hisako1337 · June 24, 2024, 1:30pm

Good morning fellow alchemists!

I have a question about quite a low level implementation topic regarding copy/concurrency semantics, let me explain the situation briefly first:

I have some (~10) ETS tables (set/unordered) that contain a few 100k records each, where the primary keys are either straight integers or tuples of integers (mimicking compound keys)
the values in the tables mostly are HashSet of integers (“IDs”)
the tables are owned by dedicated GenServer processes, so that only the owning process can update data within its ETS table, but all other processes can read in parallel from it
the owning processes subscribe to phoenix PubSub events in order to update individual ETS records in realtime as needed
there is a general “search” function that contains several keys to look up within the ETS tables, each returning a HashSet of IDs, and the final result set is the intersection of all these IDs in the hashsets (meaning IDs that are in all returned lists) + some business logic in between
this is already a performance optimization on precomputed data in memory, since the database chokes on this calculations (we optimized as much as we could before), therefore “use SQL” is not an option here
this search function is also a hot code path throughout our whole application and must be as fast as technically possible (few miliseconds, tops)
it worked quite good so far but now that the amount of records in the tables (and the lenght of the result sets) go up over time, this “algorithm” is becoming noticeably slower
we don’t want to introduce additional infrastructure components like search engines

so this is the situation right now. the precomputation of data and holding it in ETS tables already brought us a great performance boost, but it seems the “intersection” logic with custom rules is the new bottleneck now.

What we already use is Rustler for a different problem elsewhere in the project, and it basically “solved” another complicated computational problem. But here I think the actual problem is the “copying” of increasing amounts of data (few dozen megabytes) between BEAM processes, plus the computation afterwards.

Now my idea right now is to move the precomputation-code into a Rustler NIF, maybe even store it all in a single Map structure instead of multiple ETS tables, and on request use this map as a parameter + the search parameters to calculate the resulting list of IDs also in the NIF and return it.

The three questions I have right now (couldn’t find details so far):

Can I pass (and receive!) data back and forth into the NIF via pointer/reference instead of copying it to eliminate any overhead here?
Can I hold the precomputed data in a Genserver and make it accessible in a concurrent way similar how ETS can be configured for parallel read access? Without too much copying?
Can I modify parts of the data or replace the whole data container from within Rust to speed up things?

AFAIK a GenServer can not handle true concurrency (it iterates on its single linear inbox), and Task creates new processes that copy memory into it, even ETS copies memory on querying but at least updates are “in-place”.

Thanks in advance!

dimitarvp · June 24, 2024, 7:26pm

As a start, do you want that cache to be multi-node or single-node?

cpud36 · June 24, 2024, 8:59pm

In Rustler you can use resource/ResourceArc to pass a reference into Elixir code. The documentation is quite lacking, but you can look at this test.

If you want to have mutable resource, you must manually synchronise on the Rust side. E.g. you can use RwLock, or an rcu.

Beware, that resource references cannot be shared across nodes AFAIK.

Hisako1337 · June 25, 2024, 7:26am

single node, as in every node in the cluster has its own local cache, but every node in the cluster has the cache

Hisako1337 · June 25, 2024, 7:27am

oh cool, that already helps a bit, thanks!

filmor · June 25, 2024, 6:28pm

Resource references can be shared across nodes, but they can only be used (in a NIF) on the node that created them initially and only if it’s still alive when accessed. E.g. if you generate a resource object on your node from a process, send it to another node, let the process die and receive the answer back, the resource object will not “work” anymore.

I’m currently working on refactoring resources (and now that you say it, @cpud36, the docs for this one are indeed very much lacking) as they are using a mechanism that will produce warnings and later errors in rustc.

Data is always passed by reference into NIFs as they run in the same process environment as the calling process. You could (as an alternative to using a resource object) also store the data in a binary object (e.g. using something like flatbuffers). If it’s larger than 64 bytes, its data will not be stored directly in the process environment and will not be copied either.

dimitarvp · June 25, 2024, 7:42pm

My Rust chops have eroded a little bit since I haven’t used it actively in like half a year now but last I remember I have used DashMap successfully before in an application that had a very heavy usage of read and write caching. It’s single-node.

Sadly this was predating the times I became obsessed with telemetry so no clue if it’s doing better than ETS but I’d think since it doesn’t copy data that it should be faster.

Others have already pointed you at how you can create a resource inside the Rust code and return a reference to it in Elixir.

andyleclair · June 25, 2024, 8:00pm

If you’re precomputing the data, you could look into using :persistent_term but I’m not sure the viability of updating that data, since you’d force a global GC on every update. How frequent are the updates?

Hisako1337 · June 26, 2024, 7:58am

Data is always passed by reference into NIFs as they run in the same process environment as the calling process

this is hugely valuable information for me already, thanks for that!

so this would solve part of my problem already, passing the data blob into rust for further computations is “free”!

Hisako1337 · June 26, 2024, 7:59am

I have used DashMap successfully before in an application that had a very heavy usage of read and write caching.

this looks nice indeed, but my problem around it is still: where to store that data in Elixir land? a GenServer afaik does not allow for truly concurrent reads

Hisako1337 · June 26, 2024, 8:01am

How frequent are the updates?

unpredictable unfortunately - could be nothing for an hour but also some cronjob somewhere could trigger thousands of updates in a few seconds…

also I think (mis-) using persistent term for that is kind of dirty, isn’t it?

cpud36 · June 26, 2024, 7:41pm

Matches my understanding. Thank you for clarification.

I appreciate your work on the crate, and I had successfully used it just by looking at the signatures. Thank you.

I am willing to improve the docs once I find some free time. Is it ok to send a draft PR, so you can review for correctness?

Do you have an idea what is your current bottleneck? There are a several things you can do improve performance.

First, you can move business logic from GenServer into the callers. The idea is to serialise your HashSet into binary, and on request send relevant HashSet into caller. This is free since (large?) binaries are shared between processes. You can optimise even further and read ets directly from the caller, not sure though how well it performs. Computing in the caller processes will help utilising multiple cores of the cpu and reduce contention.

Second, you can optimise speed of your business logic. Either in elixir land, or by moving parts/all business logic into rust routines.

And lastly, if you still have a lot of contention on GenServer mailbox (after first idea), you can replace the whole ets table with rust object. You then wrap the rust object into a resource and just pass this reference around instead of GenServer. But my hunch is that ets is already doing sensible things, so you will not gain much without fancy concurrent data structures on the rust side.

Can you batch those?

Also, how large are your HashSets? 10k-100k, or more?

dimitarvp · June 26, 2024, 8:13pm

…Which is what DashMap is.

cpud36 · June 27, 2024, 4:04am

Not exactly . DashMap is just a bunch of shards with RwLocks. I assume ets is doing something kinda like that (it is an embedded db after all - it should have locks). But to be honest, I don’t know much about ets internals.

So my guess is that ets scalability is on par with DashMap. Am I wrong about ets?

But in the end, only benchmarking can show what is enough.

dimitarvp · June 27, 2024, 11:03am

True enough, and I said the same thing above – haven’t benchmarked either, the only thing I’d predict as a clear win would be that the Rust solution wouldn’t copy data around (I think). But indeed it has to be checked.

To OP: cachex or ane might also be what you need. ane in particular makes a special effort to not copy on cache hits, and to be faster even than raw ETS.

Hisako1337 · June 27, 2024, 3:37pm

thanks @dimitarvp - will check this out, too!