How to store lots of data in memory?

tap349 · November 26, 2017, 9:15am

Good day!

I searched the forum for similar questions but couldn’t find exactly what I need.

I use Agent process to store shared read-only data in memory. The size of data is ~10 MB.
Every user has its own handler process (implemented as GenServer process) that uses this data to calculate some result. So every time request comes, 10 MB are copied from Agent process (since everything must be immutable I guess).

I’ve noticed that the 3rd or the 4th time data is read from Agent process in each user process, it’s not fully garbage collected so memory consumed by each user process grows in size.

And if I have 1000 users, total memory consumption might grow up to 10 MB x 1000 = 10 GB which seems unacceptable.

My question is how to store lots of read-only data in memory so that it’s efficiently shared among many long-running process? Using ETS doesn’t seem to solve the problem.

P.S. data stored in memory is a parsed YML file. It occupies 10 KB in filesystem but 10 MB when read and parsed - maybe there’s a way to optimize storing it in memory as well?

Thanks.

yurko · November 26, 2017, 9:40am

P.S. data stored in memory is a parsed YML file. It occupies 10 KB in filesystem but 10 MB when read and parsed - maybe there’s a way to optimize storing it in memory as well?

There should be, it’s 1000 times more in memory then on disk, doesn’t sound right How did you measure this difference? If you dump it into file, is it 10kb again?

Using ETS doesn’t seem to solve the problem.

If you store it there “as is” as one entry it will not bring much, but if you put the parts of that (structured) data under different keys it will help with memory consumption.

See related discussion:

LostKobrakai · November 26, 2017, 12:24pm

Depending on how often your shared data does change maybe look at this one: https://github.com/discordapp/fastglobal

cdegroot · November 26, 2017, 6:55pm

I don’t understand that, because storing lots of in-memory state in ETS usually works fine. Unless of course, you still copy the data out of ETS for every request…

(also, the size does not make sense; what library are you using to parse the yaml and how does the structure look like internally e.g. when you inspect it?)

Ordermind · November 26, 2017, 11:38pm

Depending on your needs it might work to have a genserver that holds the data and which you can query through an api to get the parts you need. If that’s possible for you it should be able to cut down memory usage.

tap349 · November 27, 2017, 1:40pm

There should be, it’s 1000 times more in memory then on disk, doesn’t sound right How did you measure this difference? If you dump it into file, is it 10kb again?

Sorry for misleading all of you about data size - after reading file from disk I calculate and store some additional metadata. That’s why memory grows that large )

If you store it there “as is” as one entry it will not bring much, but if you put the parts of that (structured) data under different keys it will help with memory consumption.

It would be an option but unfortunately I need all this data to process each request
This data (read from YML file) contains sets of rules all of which need to be evaluated against each request.

tap349 · November 27, 2017, 1:43pm

Depending on how often your shared data does change maybe look at this one: https://github.com/discordapp/fastglobal17

That looks like I what need exactly - but I’m still a little bit surprised that Elixir/BEAM copies all the data on each read from Agent process, that there are no optimizations in this regard since it must be a common case. And copying all the time for each user doesn’t sound scalable.

I’ll give this library a try )

tap349 · November 27, 2017, 1:47pm

(also, the size does not make sense; what library are you using to parse the yaml and how does the structure look like internally e.g. when you inspect it?)

This part of my question is no longer relevant as I wrote above - YML parser must be okay (I use a fork of yamler).

I don’t understand that, because storing lots of in-memory state in ETS usually works fine. Unless of course, you still copy the data out of ETS for every request…

Frankly speaking, I didn’t try ETS solution by myself - I judge by what I’ve read somewhere on this forum: storing in-memory state in ETS is okay but when this state is read by thousands of processes, it’s copied from ETS each time leading to the problem I stated in the first post.

tap349 · November 27, 2017, 1:55pm

Depending on your needs it might work to have a genserver that holds the data and which you can query through an api to get the parts you need. If that’s possible for you it should be able to cut down memory usage.

As far as I understand, Agent is GenServer as well. So I must be already using it )
The point is that I need not the parts but all this data at once and on each request.
So using any kind of API wouldn’t help me.

Thanks for all answers - it looks like it’s a normal behaviour in OTP: when dealing with both Agent process and ETS, data is copied into client process each time.

So possible solutions I see now:

use fastglobal package mentioned by @LostKobrakai
tune Erlang VM to use more agressive garbage collection strategy
kill long-running user handler processes when there are no messages in their mailboxes (maybe this is why not all data read from Agent process is garbage collected after processing request)
reduce the size of data (I store lots of IDs - probably using something like NatSet instead of MapSet would help)

cdegroot · November 27, 2017, 2:02pm

Compile the set of rules. Clearly it’s code, not data. Problem solved.

Another way - depending on the problem - would be to have a pool of workers that each have a copy of the ruleset as state and something like poolboy to route requests; that would also get rid of the copying.

cdegroot · November 27, 2017, 2:12pm

It becomes less of a surprise if you think through the implications of doing that. Erlang does garbage collection on the process level, which is very simple in multiple regards: process memory is small, and processes are interruptable so a small pause to do a quick GC is acceptable. This keeps GC simple. Now, think of the case when you would get data by reference out of a process (an Agent is just a process) - suddenly the GC has to keep track of pointers globally in the VM and simplicity gets tossed out of the window.

(the actual details are, of course, a bit more complicated than I just said. This seems to be a decent quick overview with pointers to further reading. Erlang will ask you to open the hood and look at how the engine works a bit sooner than other systems, but the pay-off is good performing stable code and the investment isn’t that high (compared to, say, learning the Java Memory Model). Well worth it.)

dom · November 27, 2017, 3:11pm

Have you tried processing the request in the agent process rather than the user handler process? If the request is small but the rules are large, it would require a lot less data transfer between processes. The downside is the agent becomes a bottleneck, but it’s very easy to create a pool of them as suggested above.

peerreynders · November 27, 2017, 3:34pm

You may have your reasons for not divulging any details about your processing requirements - so it is possible that you are absolutely correct that you need a separate copy of the data for each user - and then again that may simply be a superficially convenient choice.

How many simultaneous requests do you reasonably expect to be running against the data set? How large is the result that the requestor is going to get back?

In the BEAM environment it may make more sense to separate the “request” logic from the user’s client code and instead use that logic to build a short lived process that runs the logic against it’s own copy to produce the result. While that may lead to more copying it may actually require far fewer simultaneous copies of the data in memory and process termination makes GC extremely simple.

To cut down on the copying, processes could be reused a finite number of times or indefinitely (i.e. process pools as already suggested).

Copying could also be reduced/eliminated by partitioning the data in some logical way so that it could be efficiently accessed and shared (as already mentioned via ETS and/or distributing data dependent processing among multiple processes).

Shouldn’t be a surprise:

Erlang Programming 2e: Introduction: p. xiii:

Erlang belongs to the family of functional programming languages. Functional programming forbids code with side effects. Side effects and concurrency don’t mix. In Erlang it’s OK to mutate state within an individual process but not for one process to tinker with the state of another process. Erlang has no mutexes, no synchronized methods, and none of the paraphernalia of shared memory programming.

Processes interact by one method, and one method only, by exchanging messages. Processes share no data with other processes. This is the reason why we can easily distribute Erlang programs over multicores or networks.
When we write an Erlang program, we do not implement it as a single process that does everything; we implement it as large numbers of small processes that do simple things and communicate with each other.

The same essentially applies to Elixir. Sharing is convenient but that convenience comes at a cost - it’s all about tradeoffs. Furthermore some find the utility of Agents questionable while most see them as limited, see this recent topic.

Agent is a specialization that focuses entirely on state. GenServer embodies the more general notion of a process minding it’s own state and maintaining full control over access (via messaging) and mutation of that state.
In Elixir the Task is often used for short-lived processes but GenServers will still be used for one-off processing when multiple processes have to coordinate processing in the pursuit of a common objective.

GenServer is the fundamental building block, Task and Agent are mere convenience specializations that are typically only useful under the most trivial of circumstances.

rvirding · November 29, 2017, 1:50am

ETS keeps its data entirely separate from all processes so when you access an ETS you copy the between the table and the process. HOWEVER you do not copy the whole table each time only the elements you actually access. This does mean you should not store all the data in one element but in multiple elements and you the key to select which element.

Actually there is no way to avoid copying data if you are sharing data between processes, keeping it in an ETS table or storing it somewhere “outside” the erlang/elixir system. It’s a fact of life. Just make sure you can access it in small chunks.

tap349 · November 29, 2017, 9:06pm

Set of rules are loaded when applications starts - they are stored in YML because they are meant to be edited by ordinary users.

Correct me, please, if I’m wrong - by compilation you mean hard-coding this data somewhere in config file? In this case it wouldn’t work because rules should be stored in user-friendly format (YML is user-friendly enough IMO ).

So a limited number of processes (workers) have their own copies of data while user processes just pass request payload into workers to calculate the result? Well, that sounds reasonable - I’ll need to move all calculations from user processes into workers but that should not be a problem.

The only downside I see here is that I’ll be able to process max <pool_size> requests concurrently - but it must be okay for me too.

tap349 · November 29, 2017, 9:09pm

AFAIK agents are used for storing state only - and putting business logic inside them would break this contract. But moving logic and data access into dedicated processes (not necessarily agents) looks promising.

tap349 · November 29, 2017, 10:11pm

Thanks for detailed answer.

I don’t have strict requirements. I can’t say that I need a separate copy for each user - this is just how it’s currently implemented (maybe because it’s a ‘convenient choice’ or due to lack of knowledge of how to do it efficiently).

My application is kind of microservice that receives request (each having user_id and some payload) and calculates the result using this payload and the whole set of rules (the very data).
When request for new user_id is received, a separate user handler process (GenServer) is created for this user - this process reads data from agent and calculates the result. While there are no many simultaneous requests, there might be lots of long-running user handler processes (1k-10k). Result the requestor is going to get back is not large (about 1kb).

To summarize (for myself and for future readers) I have the following options:

spawn a short-lived process (say, Task) from inside user handler process - this task reads data from agent, calculates the result, returns it to user process and dies

(+) when the task is terminated, it’s easy for GC to clean the copy of data read from agent inside terminated task
(-) data is still copied a lot which has its own impact on performance (copying 10-20 MB in memory is cheap but still not free)

use a limited set (pool) of long-running processes (workers) each having their own copy of data (just like @cdegroot suggested) in their states

(+) data is copied (and duplicated in memory) N times only (where N is a pool size)

This resembles solution suggested by @dom but here another abstraction (worker) is introduced between user handler process and agent.

partition data so that it’s accessed ‘in small chunks’ (either from agent or ETS)

(+) amount of copied data is reduced - this alleviates the problem but doesn’t remove it completely
(-) in my case that wouldn’t work since I cannot partition data - I need to access it fully in each request

use fastglobal package

(+) viable alternative if the problem can’t be solved using standard OTP tools
(-) it’s an external dependency

Taking all pros and cons into consideration, the 2nd solution looks like the best one for me now.

peerreynders · November 29, 2017, 10:47pm

This seems to be the most reasonable approach (in my opinion). Ditch the agent and just initialize each worker with it’s own copy of the data. Have a look at elixir_poolboy_example.

dom · November 30, 2017, 12:50am

I was suggesting that you can prototype that approach with very little change to your code, by doing the processing directly in the agent. Then if you’re happy with the performance you can ditch the agent and replace it with a (pool of) GenServer workers. You don’t need both agent and worker.

After that you can make all your workers join a process group (with pg2 for instance) and broadcast a message to that group when the config is updated.

cdegroot · November 30, 2017, 2:04am

Compilation means transforming stuff from user-friendly to machine-executable format. Read the Yaml, convert it to Erlang/Elixir/LFE code, then compile that source, and you have executable code - which seems to be what you want.

Usually, when you find you’re writing an interpreter, you’re doing it wrong. See also: Ruby