DaAnalyst
GenServer timeout
I use GenServer (restart: :temporary) for a module that manages some state and handles access to it. Its init function and its handle_* functions return the same timeout value. The GenServer-based state-managing processes get started by invoking GenServer.start_link and a :via tuple for the name.
I also have another module that uses DynamicSupervisor to supervise the aforementioned processes and a function to start them as its children.
When a process with a certain name already exists (for it has not timed out at the time of fetching its pid), the code pattern-matches GenServer.start_link return value with { :error, { :already_started, pid}} and returns the pid of the process that is still alive. If the process does not exist, the DynamicSupervisor starts a new one and returns its pid.
This all works perfectly fine.
Now, the question part. When the client calls this pid fetching function which calls GenServer.start_link which in turn constructs the process name and fetches the pid (of an existing processes or a newly started one), the next line in the client code uses this pid for invoking a business function (GenServer.call) that ends up modifying the managed state. Since an existing process can have an arbitrary time left until the timeout, it is possible that it times out precisely between those two lines of code (an extremely rare situation, but it may happen) in which case the call will exit with:
** (EXIT) no process: the process is not alive or there’s no process currently associated with the given name, possibly because its application isn’t started.
How should I deal with this probability in general? Should I trap this exit? What is the best practice to handle this?
Thanks
Marked As Solved
lud
I mean this problem is so obvious
I agree but the problem exists because we use a Registry for our processes. I think that there is no way to ensure that your process cannot stop in between a fetch on the registry and a call ot the process.
I believe you can only mitigate the problem and that people either just go optimistic (ignore the problem because the timeframe is soooo small that it virtually never happens) or just try/catch a no proc error.
Possible solutions
You could create a fetch_pid that executes a lookup in the registry and then ask (call) the process if it is alive (which will refresh the timeout), and finally return the pid. Not very satisfying and adds more load on the system.
If you do not have many users yet you could say it is premature optimisation and just reload your data. Or use an ETS cache without concurrency so the data either exists or not, period. I personally use a queue for that instead of a dedicated process for each entity, but the data is copied back and forth between ETS and the worker.
Another solution which is technically correct but heavy is to use a synchronous registry, and have the registry handle stopping your temporary processes : on timeout, the temporary process sets a flag idle: true, and tells the registry “i am idle” (with a cast). Then the registry will call the process, asking if “still idle ?” and if yes, terminate it. But it is so heavy I would not do that.
Use a pool : instead of temporary processes that handle each entity, have N workers that handle multiplie entities, hash your entity to a worker with erlang:phash2/1 and have the worker manage the cleanup of memory for each entity of its own.
I also have an elixir library called mutex. There is a feature that I would add to it : when a process locks a key that was previously locked, the new process can inherit data from the previous owner. I can add the feature if you want. So you lock the key, inherit the data (or load it from database if it expired), do your work, and release the key with the new data. But again your 100K of data will be copied multiple times in memory.
In the end, a try/catch is the simple thing to do.
Also Liked
lud
Glad that you found a new home !
Also I forgot to tell, but if the code that calls your registry and then your worker is supervised, and could be restarted issuing the same command, you could also just let the no-process error crash ![]()
lud
I feel it is the same : try/cath and retry is the same as crashing the process with noproc and have it be restarted. In both cases your code will fetch the pid again and call your worker (all at once using :via).
The latter will repeat more code obviously, but very rarely, and keep the codebase clean.
If you are for example in a controller, you’d rather try/catch because the process will not be restarted and just send a 500 back. But if you are for example in a queue worker, the worker would be restarted on failure and there I’d let it crash. Depends on where you are ![]()








