Worker with state vs Cache + Worker

smanza · January 23, 2020, 10:51am

Hello !

I’m wondering of the best approach to use regarding data coming from worker processes under supervision tree.

Is it better to use state per worker (ie. GenServer, :gen_statem) and restart over when they fail (with maybe mechanism to save data when they are down: ie ETS) or use a cache process which is filled by worker process (ie. Task) ?

The last approach is more centralized instead of the first one which is more distributed/decentralized

Thanks

wolfiton · January 23, 2020, 12:35pm

I found this article and read a lot of Genstage and OTP in general and it seems that Elixir Agents could be a good fit for your requirement especially if you have a lot of concurrency.

Check this article for more info and clarity https://medium.com/scientific-breakthrough-of-the-afternoon/elixir-agent-vs-genserver-ef443aa4a441

Also this https://blog.codeship.com/concurrency-abstractions-in-elixir/

shanesveller · January 23, 2020, 12:59pm

I would not depend on Agents for a production use-case, particularly compared against a purpose-built GenServer or gen_statem module. The latter can have fully custom/arbitrary lifecycle behavior, error handling, proper supervision, etc.

For OP’s actual question, can you say more about the nature and use of the data? What’s its volatility, its source of truth, how expensive is it to rebuild from nothing, are concurrent writes necessary, what’s the appetite for eventual consistency, how dangerous is it for multiple BEAM nodes to have disjoint views of the information, etc.

smanza · January 23, 2020, 1:47pm

For what I can say, so a BEAM node will receive a request and must do concurrent jobs to build a context for a further computation. The node need to keep the result of each job.
Some job can be really fast to rebuild and some are really expensive (such as multiple network calls).
The concurrent writting does not really matter, because job will produce independant data.

The other question which interest me also, it is better to keep multiple process with small or medium data or only one process with a lot of data ? (evenif ETS can be used to leverage heap allocation)

My current approach is:

Top Supervisor including: Registry and DynamicSupervisor.
The dynamic supervisor will spawn for each specific request its own supervisor.
This latter will spawn jobs (currently :gen_statem to simply state identification and processing).
Each jobs will be registered into the Registry and keep the job result.

When I want to retrieve the all the job data, I’m using Registry.dispatch to broadcast the retrieval of the state and data from the jobs.

Another approach specially regarding 3. and 4. will be to insert a cache inside the latest supervisor and each job as Task where each will fill the cache and die after. (freeing maybe some memory)
The retrieval of the data will be directly from this cache. (but cache memory will increase)

wolfiton · January 23, 2020, 1:59pm

Thanks @shanesveller, for explaining why you wouldn’t use Elixir Agents in this situation.

gregvaughn · January 23, 2020, 4:26pm

This sounds like a job for TaskSupervisor.async_stream to me. I use that a lot to “fan out” jobs into concurrent pieces, but still collect their responses. Do review all the options for that function though. I use ordered: false quite often.

However, if you really need :gen_statem to manage incoming messages to each of your jobs, then this approach is too simple.

dimitarvp · January 23, 2020, 6:56pm

IMO you should have many processes with smaller data. Copying from ETS will be easier on the GC (after the process dies) and will also be faster.

The calculation results that are hard to recompute should go into a database. Everything else is fine in ETS.

smanza · January 24, 2020, 9:13am

Good point @gregvaughn, I didn’t think about Task.Supervisor because I didn’t want to use Task because I need to keep state somewhere and I do not want to retrieve the data when they are finished but when I will receive another request to check this state.

And for the :gen_statem approach, jobs will not received really messages only a way simplify the state identification and transitions. Only at the end I will request process state and date for example using :sys.get_state. But maybe there is a better way to do.

smanza · January 24, 2020, 9:15am

Thanks @dimitarvp, I was thinking the same, just I didn’t know if it’s impact more the memory to keep multiple process with state vs only process with more big state.

gregvaughn · January 24, 2020, 3:14pm

No, please, a thousand times no. Do not use :sys.get_state in production code. It is intended only for debugging purposes. You speak a lot about managing state carefully, but then you want to do this brutal approach.

My suggestion wasn’t so much about Task.Supervisor as it was about async_stream in which case the results come to you when they complete. No need to reach into another process’ state (which would be akin to me to grabbing money out of your wallet because you owe me).

And gen_statem seems to be overkill for your purposes too. The point of a “gen” style server is to be able to receive messages from other processes. You can use a basic state machine approach with Enum.reduce and an appropriate accumulator map/struct.

smanza · January 26, 2020, 10:08pm

Thanks @gregvaughn.

Do you have any example about your idea using state machine with simple Enum.reduce ?
because at the end , I need also to keep state of the data, even if the only received message will be “get_job_result” . So wondering if a simple tasks + ETS will be sufficient.

gregvaughn · January 26, 2020, 10:23pm

I’m sorry I don’t have some example to share, but I view it as some basics of functional programming. Here’s some pseudo code (caution: I have not executed it) to think about

Enum.reduce(data_list, (:initial_state, nil), &transition/2)

def transition(data_element, {:initial_state, accumulated_data}) do
  #do something and return {new_state, new_accumulated_data}
end

def transition(data_element, {:state2, accumulated_data}) do
  #do something else and return {new_state, new_accumulated_data}
end

# ... plus as many more transition/2 function clauses you need

Now the result of the Enum.reduce is {final_state, final_accumulated_data} and if you use the async_stream approach I suggested, it will be sent to the calling process automatically. There is no need to store this (in process state or ETS) and later retrieve with some :get_job_result message. That is the whole advantage of the async_stream approach.

smanza · January 27, 2020, 1:49am

Thank you very much @gregvaughn .
Nice way to do it

But I my use case, I don’t known if async_stream is the right thing to do, because I need to retrieve the result asynchronously later not after the job completion but when I will receive to check and compare the jobs results.

ityonemo · January 27, 2020, 4:25am

I would say when to use gen_statem:

you’re modeling an external “real thing” that you have limited control over, or stateful communications protocol.
there are recurrent events that must be asynchronous (eg you have to check on your external thing and you don’t know when the response will come back)
your model is not transient (if your process crashes you want to recover, and you can’t just throw it away and start from scratch)

If all do not apply, do not use gen_statem.