Using Registry as a counter? (or better alternatives?)

darkmarmot · June 27, 2018, 8:11pm

I have many actor processes in my application with event counts associated with each actor.

I’d like to have a live sum of the total actor events across the node (basically map-reduce across the processes).

I was thinking that I could register each process in the Registry with a count – and store a separate sum in the Registry that gets incremented with new events and decremented when actors crash (using trap_exit?)

Does that sound like an appropriate solution in Elixir or is there a more canonical OTP way to handle this?

Thanks,
Scott S.

christhekeele · June 27, 2018, 9:23pm

Hey Scott! I’d probably use ETS for this, that should get you started down a more idiomatic path.

dimitarvp · June 27, 2018, 9:24pm

ETS has a function for atomically increasing a single number variable.

Called from Elixir like this: :ets.update_counter.

darkmarmot · June 27, 2018, 9:55pm

Thanks, update_counter looks good!

Should I make a genserver that monitors every actor to decrement counts when they crash?
Or is there a better way for handing that part?

christhekeele · June 27, 2018, 11:16pm

Depending on if your workers implement OTP behaviours, they should have terminate call backs you could decrement from.

cmkarlsson · June 27, 2018, 11:17pm

With the warning that the terminate callback is not always called and cannot be relied upon to run.

sasajuric · June 28, 2018, 11:30am

+1 that terminate callback is not the appropriate place to cleanup counters of processes which are terminated. You need another GenServer to monitor these processes and perform the cleanup.

Instead of rolling your own, you could use gproc aggregated counters.

To make that work, you need to register the aggregated counter, e.g. in your application start callback, or in some top-level process:

:gproc.add_local_aggr_counter(:my_counter)

Now, in every process, you initialize the local counter when the process starts:

# invoke in each actor process
:gproc.add_local_counter(:my_counter, 0)

Where 0 is the initial count for that process.

When you want to change the counter value, you can use update_counter:

# invoke in each actor process
:gproc.update_counter({:c, :l, :my_counter}, increment)

The :c and :l indicate that you’re updating a local counter which is tied to the current process. If the process terminates, its count will be removed from the aggregated count.

To get the aggregate value (sum of all counters), you need to invoke:

:gproc.lookup_local_aggr_counter(:my_counter)

Demo:

:gproc.add_local_aggr_counter(:my_counter)

:gproc.lookup_local_aggr_counter(:my_counter)
# 0

# start one agent and bump its count by 1
{:ok, agent1} = Agent.start_link(fn -> :gproc.add_local_counter(:my_counter, 0) end)
Agent.update(agent1, fn _ -> :gproc.update_counter({:c, :l, :my_counter}, 1) end)

# The aggregated count is now 1
:gproc.lookup_local_aggr_counter(:my_counter)
# 1

# start another agent and bump its count by 2
{:ok, agent2} = Agent.start_link(fn -> :gproc.add_local_counter(:my_counter, 0) end)
Agent.update(agent2, fn _ -> :gproc.update_counter({:c, :l, :my_counter}, 2) end)

# the aggregated count is now 3 (1 from agent1 and 2 from agent2)
:gproc.lookup_local_aggr_counter(:my_counter)
# 3

# stop agent2
Agent.stop(agent2)

# The aggregated count is now 1 (1 from agent1)
:gproc.lookup_local_aggr_counter(:my_counter)

amnu3387 · June 28, 2018, 1:12pm

But is there any way you can decide if using terminate's callback is appropriate or not? Like, if it’s a clean exit, it will always be called, if it’s a crash it might or might not, has it to do if you’re using distributed erlang, what are the guidelines to decide if using terminate is an appropriate decision or not?

For instance, I have a genserver responsible for processing files, assuming it doesn’t crash, is it ok to, for instance, remove the file from the terminate callback?

It just seems that if you can “never” rely on it, under any circumstances that it doesn’t make sense to even exist? Or am I missing something about its implementation and use cases?

sasajuric · June 28, 2018, 2:16pm

This is a very good question. I personally mostly avoid terminate, because it won’t be invoked if the process crashes, or if it’s forcefully terminated (killed) from the outside. Thus, if some cleanup code must be executed, I prefer having another process to do it.

However, using a cleanup process isn’t synchronous, since the cleanup code will run after the “main” process has terminated. Therefore, there are some special cases where terminate works better. For example, supervisor terminates children from the terminate callback. This ensures that when the supervisor goes down, its complete subtree is already down. I can’t think of a way to ensure such synchronism by using a separate cleanup process.

However, this approach suffers from potential theoretical issues. If a supervisor process is brutally killed or if it crashes, then this guarantee doesn’t hold. The child processes will still be taken down eventually, but not immediately. If some descendant is trapping exits and ends up in an infinite loop, it might never happen. This in turn could prevent the restart of the crashed supervisor, which could take down the entire system, or it could lead to duplicate processes running, which could cause some strange behaviour of the system.

However, we can assume that the supervisor process is thoroughly tested and hopefully free from unexpected crashes. In addition, in a properly constructed supervision tree, a supervisor is never brutally killed (because :shutdown of a supervisor is :infinity by default), so I’d say that these issues are theoretical, and very unlikely to occur in practice.

So my take would be to use terminate to implement a synchronous cleanup (things are cleaned up before the process terminates). For example, I use it in Parent to terminate children, similarly to supervisor. In such cases, you probably want to keep the logic of the process simple, to reduce the chance of it crashing. You may also consider setting it’s shutdown option to :infinity to prevent its parent from brutally killing it.

If the synchronism is not required, and a cleanup code must be executed when a process goes down, regardless of how/why it goes down, I’d suggest using a separate process.

amnu3387 · June 28, 2018, 3:20pm

Thank you @sasajuric
So the takeaway is that usually creating the process (ie. genserver) and monitoring it from another process (ie. another genserver) is a more sturdy solution for effectively dealing with cleanups?
And the monitor would be part (usually) of the application root supervision since it would only deal with monitoring/cleanup, while the processes executing “work” would be part of their own subtree/supervisor?

I might be wrong, but I had the idea that while developing some genservers in the past that held “game state”, sometimes I would crash them while working on the code, and yet the terminate was still called?

sasajuric · June 28, 2018, 4:04pm

Yes, it’s a bit more nuanced. A crash in init/1 won’t lead to terminate being invoked, while an exception raised by handle_* will. However, a linked error (e.g. a child task crashes) won’t lead to terminate, because the exit signal will take the process down (unless it’s trapping exits). Moreover, if a parent supervisor decides to stop the server, terminate is invoked only if the server is trapping exits.

As you can see, there are all sorts of edge cases here, but the main point is that, no matter what you do, you can’t be completely sure that the termination logic is invoked, so if you want stronger guarantees, it’s IMO better to use a separate process. Of course, not even that will ensure that the cleanup code is invoked, e.g. if BEAM OS process is killed, or someone pulls the power plug , but within the BEAM instance, you have more guarantees than when using terminate.

sasajuric · June 28, 2018, 4:13pm

You can either have one “monitor” process for multiple workers, or have one companion process for each worker.

A downside of the former is that a single monitor might be a bottleneck if workers are frequently created/terminated. Moreover, a crash during the cleanup will crash all other pending cleanups (and maybe even all other workers).

A downside of the latter is process/memory overhead (for N workers, you’ll need 2N processes, or 3N if you introduce a parent supervisor for each worker-companion pair).

So, as usual, the answer is “it depends”

dimitarvp · June 28, 2018, 4:13pm

That is a valid concern, however Process documentation says:

Each entry in the registry is associated to the process that has registered the key. If the process crashes, the keys associated to that process are automatically removed.

Moreover, Registry allows duplication, so you can just have one atom managing a counter (essentially), akin to :ets.update_counter.

Combined with a mechanism that detects if a shutdown was clean – f.ex. a file contains a tuple {:error, :unclean_exit} and only changing it to {:ok, :clean_exit} if the supervision tree exits properly – should do the trick in total, or am I misunderstanding badly here?

michalmuskala · June 28, 2018, 4:15pm

This is my go to reference for when terminate/2 is called: https://gist.github.com/mrallen1/806fe5506132260574af33e99dadd499

sasajuric · June 28, 2018, 4:23pm

Registry is an example of a monitor process we’re discussing here, so, yes, it can be used to maintain individual counts, and these counts will be properly removed when each process terminates.

However, the problem is that AFAIK Registry doesn’t support aggregate (total) counts, so you’d need to compute this manually, each time you need it. Perhaps that’s good enough.

If not, the alternative I proposed is :gproc which is an older, singleton version of Registry, and it supports aggregated counts out of the box. Each process manages its own count, while you can get the aggregated count via a simple lookup. From what I understand, :gproc takes care of keeping this aggregated count in sync whenever an individual count is modified.

dimitarvp · June 28, 2018, 4:47pm

Thank you for clarifying. I admit I never even tried :gproc.

Since the OP also wanted processes, maybe DynamicSupervisor.count_children would have all they want in a single package? On one hand, processes that are dynamically spawned but still supervised, and on the other hand, having an out-of-the-box counting mechanism?

Sorry if I am asking naive questions. I have used OTP several times very successfully but I don’t feel I am on an advanced level yet.

sasajuric · June 28, 2018, 5:00pm

The way I understand the original problem, each actor process can have some count of events, and the OP needs to compute the total count of events. So, for example, we could have process A with 3 events, and process B with 5 events, and the total count is then 8 events. Counting processes won’t work here, because each process has a “weight” (event count) associated with it.

If I misunderstood the problem, then my proposed solution is possibly not suitable, or if it is, then it’s by accident

dimitarvp · June 28, 2018, 5:28pm

Oh I see now, I misread the OP earlier, my bad. Serves me right for trying to rewatch Guardians of the Galaxy 2 and participate here.

Well since the people in this thread enumerated several very viable solutions I’d say the OP has to adapt their case and code to what’s available if they really want a reliable and a real-time counter. I’d probably either use :gproc as you suggested or make one event per process and use DynamicSupervisor.count_children – but the other solutions should also work.

Thanks for the discussion, I learned a few things.

darkmarmot · June 28, 2018, 7:51pm

:gproc looks really interesting!

I’m going to do some experiments with it and with Registry.

Thanks!

amnu3387 · June 29, 2018, 10:45am

Thanks for further explaining. So (caveats may apply, but usually) the decision on using terminate to handle cleanups or an additional monitor is more related to the types of crash that can occur inside the gen_server itself. If we’re spawning/starting linked processes from inside the gen_server, or doing an init that isn’t purely functional (not required but a good sign) and may crash, a crash in these linked processes will bring the calling gen_server down without triggering terminate and without a way of ensuring any cleanup unless the calling gen_server is itself trapping exits.

If there’s a crash inside the gen_server handle_* callbacks themselves, and outside the init, it will theoretically always call terminate.

That’s a very good link, thanks for sharing.

@darkmarmot sorry for sidetracking your thread ^