How to avoid the performance impact of large ETS-based queries?

Cruz · April 21, 2019, 3:40pm

Hi,

I’m writing a service in which each transaction starts with the system returning a “catalog” of available products to each caller/client. There are more than 1K products and 11 different of these catalogs. Each product contains a series of IDs, name and descriptions in two languages, multiple flags, and other data. In other words, the size of each catalog is considerable.

Given the frequency in which the catalogs are used, I thought to load them into ETS tables for quick access. However, I found the following article:

https://medium.com/@jacob.lerche/using-constant-pools-to-speed-up-your-elixir-code-c527d533c941

The author suggests using a macro when the size of the data is too large for ETS. However, he doesn’t provide an idea as to how much is too much data. And, using macros implies having to recompile every time the data changes. This might be OK, but I prefer to avoid it if possible.

Does anyone has found this issue before? Do you know when the size of the data becomes an issue for ETS?

I plan to do some load testing with the ETS based solution. Any other suggestion?

Thank you

wmnnd · April 21, 2019, 3:48pm

@evadne did a presentation at this year’s ElixirConfEU where she benchmarked ETS.
The video isn’t up yet, but here are her slides and the demo repository:

Cruz · April 21, 2019, 4:14pm

Thank you. I didn’t see any reference to my specific case, but the set of references at the end of her presentation seem very useful. I find:

in particular, very interesting. It’s a hybrid approach that will allow me to keep using ETS but dynamically produced an Erlang module for “reads”. I’ll play with it.

Thanks again

idi527 · April 21, 2019, 4:26pm

I plan to do some load testing with the ETS based solution. Any other suggestion?

That’s probably the best approach …

Also providing a sample of your data and the planned access patterns would be useful in case you run into performance problems.

benwilson512 · April 21, 2019, 4:32pm

Hey @Cruz. Does each transaction load the entire catalog or just specific items within it? Specifically, the article is talking about how the copying penalty happens when the items you want to look up are themselves very large, not necessarily when the whole table is large. If the whole ets table is large but each item is small, and you only want a few items, those are the only items that are copied.

These days however if it’s a very constant sort of thing I’d look at using http://erlang.org/doc/man/persistent_term.html You get basically the same performance and copying characteristics without having to deal with macros.

Cruz · April 21, 2019, 4:45pm

Hi Ben,

Thank you. I had completely forgotten about persistent_term. That seems to be the optimal solution since I’m just trying to avoid the copy of the data.

To answer your question, I have to support both type of operations. As a first step, the appropriate Catalog will go back to the client in full - that’s all 1K products. Then, they will select from that catalog some products, and the system will have to individually access them on each subsequent transaction. For this second type of transactions ETS is great, but I was worried about the first type.

I’ll proceed with the benchmark, but I now have plan B

Many thanks!

benwilson512 · April 21, 2019, 4:53pm

@Cruz There’s a final option here which is to treat the catalog like a file and put it on a CDN with appropriate cache headers. If it changes rarely and clients are of the type that can cache values locally then in an optimal world they shouldn’t even be reretrieving it all that often.

tty · April 21, 2019, 4:55pm

I’m usually the first person to discourage ETS and use Mnesia from the get go because most application evolution would be distributed in the following phase.

However, sticking with ETS I have found that, all things equal, its performance is greatly bounded by OS with Linux > Solaris > Windows (in 2014) with table sizes going past 3.5GB. Unfortunately I haven’t done any performance tests recently.

In one project we ditched ETS in favor of loading the data in modules. Our use case allowed very clear separation of the data. E.g. loading from a static data source we had ModuleBookOne, ModuleBookTwo etc which rarely changed. We then compiled and distributed the modules as per normal. Doing a distributed load meant we could preload this module across our system without stoppage.

All in all, load testing would be your best bet.

Cruz · April 21, 2019, 4:57pm

That’s also a good idea, but for the moment, I like that the Erlang team provided a solution for this. It will allow me to keep the solution self-contained

jacoblerche · April 21, 2019, 6:32pm

Hey there, author of the article here. Ben already gave very comprehensive answers, I’ll just add a few points as to when to use constant pools

Large data that needs to be read by a lot of processes. If it’s just a handful of processes, you might be better off with ETS or something else, unless the data is gigantically large
Data that still needs to be updated regularly, but infrequently compared to the reads

I should note that compilation of a module with static data is actually deceptively fast especially if you use Module.create/3.

I should also point out that binary data benefits from reference counting. IIRC, if the binary is 64 bytes or greater only a reference to it is copied over to a process. Just another thing to possibly consider if it fits your needs.

michalmuskala · April 22, 2019, 8:41pm

I saw the mention of :persistent_term, so I’d like to underline that dynamically compiling modules with static data now that we have :persistent_term is in almost all of the scenarios going to be slower and will do more operations than using :persistent_term. I’d consider that technique to be largely obsolete now.

Cruz · April 22, 2019, 9:14pm

Hi Michal,

Yes, I’ll definitely use :persistent_term, if ETS becomes to slow for the “large data” query. Actually, I read somewhere that even the Discord team stopped using their FastGlobal library in favour of plain ETS. So, perhaps I don’t need any optimization either.

Thank you for your feedback