GenServer timeout

DaAnalyst · October 17, 2019, 6:16pm

I use GenServer (restart: :temporary) for a module that manages some state and handles access to it. Its init function and its handle_* functions return the same timeout value. The GenServer-based state-managing processes get started by invoking GenServer.start_link and a :via tuple for the name.

I also have another module that uses DynamicSupervisor to supervise the aforementioned processes and a function to start them as its children.
When a process with a certain name already exists (for it has not timed out at the time of fetching its pid), the code pattern-matches GenServer.start_link return value with { :error, { :already_started, pid}} and returns the pid of the process that is still alive. If the process does not exist, the DynamicSupervisor starts a new one and returns its pid.
This all works perfectly fine.

Now, the question part. When the client calls this pid fetching function which calls GenServer.start_link which in turn constructs the process name and fetches the pid (of an existing processes or a newly started one), the next line in the client code uses this pid for invoking a business function (GenServer.call) that ends up modifying the managed state. Since an existing process can have an arbitrary time left until the timeout, it is possible that it times out precisely between those two lines of code (an extremely rare situation, but it may happen) in which case the call will exit with:

** (EXIT) no process: the process is not alive or there’s no process currently associated with the given name, possibly because its application isn’t started.

How should I deal with this probability in general? Should I trap this exit? What is the best practice to handle this?

Thanks

lud · October 17, 2019, 9:32pm

There are multiple solutions to this problem, but it depends on what you want to do.

You could simply just call GenServer.call and if you catch a “no process” error then start_link and call again.

We need more information on you state : why is it temporary, why at some point you will let it time out.

Note that a timeout will not terminate your process unless you do it explicitly.

DaAnalyst · October 18, 2019, 6:46am

I am aware the process wouldn’t terminate by itself. That is why I have this handle_info function that receives the :timeout and returns { :stop, :normal, state}.

I use GenServer to achieve a sequential access of all change requests over a single instance of a certain data structure (“class”) that otherwise gets persisted (by this very same module) in the database. A “manager” process gets instantiated for each such data structure instance and more than one user (and more than one UC) can thus modify the state that it controls. All state-modifying logic is thus centralized in a single module and all access to it is sequential.

The process is made temporary because I don’t want it restarted in case of a real error (such as writing into the db). It times out because it is set to keep this state for “some” time in the server-side memory so that it doesn’t take constantly loading and converting 10’s or 100’s K of from JSONB to embedded schema, but only if User inactive for some time.

And yes, I would like to have the client code call the pid fetching once more if the process is terminated (in-between the two lines of code), but I needed to know if somebody knew something I don’t (for instance, I searched for on how to retrieve the info on time remaining until the timeout for a process, but found nothing). I mean this problem is so obvious and should be a frequent issue for anyone who uses temporary GenServer with timeouts. Maybe there is already some design pattern to handle this.

lud · October 18, 2019, 10:45am

I mean this problem is so obvious

I agree but the problem exists because we use a Registry for our processes. I think that there is no way to ensure that your process cannot stop in between a fetch on the registry and a call ot the process.
I believe you can only mitigate the problem and that people either just go optimistic (ignore the problem because the timeframe is soooo small that it virtually never happens) or just try/catch a no proc error.

Possible solutions

You could create a fetch_pid that executes a lookup in the registry and then ask (call) the process if it is alive (which will refresh the timeout), and finally return the pid. Not very satisfying and adds more load on the system.

If you do not have many users yet you could say it is premature optimisation and just reload your data. Or use an ETS cache without concurrency so the data either exists or not, period. I personally use a queue for that instead of a dedicated process for each entity, but the data is copied back and forth between ETS and the worker.

Another solution which is technically correct but heavy is to use a synchronous registry, and have the registry handle stopping your temporary processes : on timeout, the temporary process sets a flag idle: true, and tells the registry “i am idle” (with a cast). Then the registry will call the process, asking if “still idle ?” and if yes, terminate it. But it is so heavy I would not do that.

Use a pool : instead of temporary processes that handle each entity, have N workers that handle multiplie entities, hash your entity to a worker with erlang:phash2/1 and have the worker manage the cleanup of memory for each entity of its own.

I also have an elixir library called mutex. There is a feature that I would add to it : when a process locks a key that was previously locked, the new process can inherit data from the previous owner. I can add the feature if you want. So you lock the key, inherit the data (or load it from database if it expired), do your work, and release the key with the new data. But again your 100K of data will be copied multiple times in memory.

In the end, a try/catch is the simple thing to do.

DaAnalyst · October 18, 2019, 11:14am

Thanks for the elaborate reply. I will use try/catch because it’s the lightest of all approaches and I don’t want to take a path that in part replicates the original intent of the whole Erlang multiprocessing paradigm.
I just needed to know if I was missing something (I am new to Elixir and Erlang, but am a seasoned sw architect - used to build enterprise systems on the then- state of the art middleware in the late 90s and early 00’s) and I find both Erlang (from architectural perspective) and Elixir (from language perspective) fantastic. Actually, when I first laid my hands on those a couple of months ago, it took me about a week to realize I would probably never ever use anything else (ok, maybe Rust for real-time stuff). And with liveview and everything, the prospects of getting rid of JS bloatware are getting brighter each day.
Thanks again.

lud · October 18, 2019, 1:13pm

Glad that you found a new home !

Also I forgot to tell, but if the code that calls your registry and then your worker is supervised, and could be restarted issuing the same command, you could also just let the no-process error crash

DaAnalyst · October 18, 2019, 1:35pm

Ah yes, the famous “let it fail” paradigm (which I’ve long been in favor of).
But in this case, I am not sure whether it will always be the case for the client code, and I like to patch the “holes” as I go, especially the ones I can tell upfront will be difficult to get by accident (except for the end users who will, by Murphy’s law, be getting plenty of those). I believe this really is a case for try/catch/repeat once more (naturally, just the :noproc exit).

lud · October 18, 2019, 4:32pm

I feel it is the same : try/cath and retry is the same as crashing the process with noproc and have it be restarted. In both cases your code will fetch the pid again and call your worker (all at once using :via).

The latter will repeat more code obviously, but very rarely, and keep the codebase clean.

If you are for example in a controller, you’d rather try/catch because the process will not be restarted and just send a 500 back. But if you are for example in a queue worker, the worker would be restarted on failure and there I’d let it crash. Depends on where you are

DaAnalyst · October 18, 2019, 5:30pm

Exactly