How to store lots of data in memory?

peerreynders · November 27, 2017, 3:34pm

You may have your reasons for not divulging any details about your processing requirements - so it is possible that you are absolutely correct that you need a separate copy of the data for each user - and then again that may simply be a superficially convenient choice.

How many simultaneous requests do you reasonably expect to be running against the data set? How large is the result that the requestor is going to get back?

In the BEAM environment it may make more sense to separate the “request” logic from the user’s client code and instead use that logic to build a short lived process that runs the logic against it’s own copy to produce the result. While that may lead to more copying it may actually require far fewer simultaneous copies of the data in memory and process termination makes GC extremely simple.

To cut down on the copying, processes could be reused a finite number of times or indefinitely (i.e. process pools as already suggested).

Copying could also be reduced/eliminated by partitioning the data in some logical way so that it could be efficiently accessed and shared (as already mentioned via ETS and/or distributing data dependent processing among multiple processes).

Shouldn’t be a surprise:

Erlang Programming 2e: Introduction: p. xiii:

Erlang belongs to the family of functional programming languages. Functional programming forbids code with side effects. Side effects and concurrency don’t mix. In Erlang it’s OK to mutate state within an individual process but not for one process to tinker with the state of another process. Erlang has no mutexes, no synchronized methods, and none of the paraphernalia of shared memory programming.

Processes interact by one method, and one method only, by exchanging messages. Processes share no data with other processes. This is the reason why we can easily distribute Erlang programs over multicores or networks.
When we write an Erlang program, we do not implement it as a single process that does everything; we implement it as large numbers of small processes that do simple things and communicate with each other.

The same essentially applies to Elixir. Sharing is convenient but that convenience comes at a cost - it’s all about tradeoffs. Furthermore some find the utility of Agents questionable while most see them as limited, see this recent topic.

Agent is a specialization that focuses entirely on state. GenServer embodies the more general notion of a process minding it’s own state and maintaining full control over access (via messaging) and mutation of that state.
In Elixir the Task is often used for short-lived processes but GenServers will still be used for one-off processing when multiple processes have to coordinate processing in the pursuit of a common objective.

GenServer is the fundamental building block, Task and Agent are mere convenience specializations that are typically only useful under the most trivial of circumstances.