Our Elixir application has a particular GenServer that all requests to particular API endpoints go through. Given that, we carefully designed the GenServer to ensure that all of its operations are very fast. We limited it to simple operations that, from what I understand, are all constant-time:
-
Map.pop
,Map.get
,Map.put
Process.monitor
-
Supervisor.start_child
(I know this could be a long operation if the child’sinit/1
callback takes a long time, but it only initializes a state struct, sends itself a message for follow-up init, and setsLogger
metadata)
Recently, we’ve been noticing that calls to this GenServer can sometimes be quite slow–to the point where they sometimes timeout (with the default 5 second timeout). We’ve been trying to identify the source of the slowness for over a week (by adding additional logging and also inspecting a live node using the observer and recon) and thus far have been stumped. Here’s what we’ve learned so far:
- The GenServer’s message queue is not getting backed up. At one point I put something in place that was logging the queue length once per second and we never saw it go above 3. According to our logs, this GenServer only handles about 10-20 messages/second at peak, which is in line with what we would expect given our level of traffic (which is not particularly high).
- The GenServer
handle_call
andhandle_info
callbacks generally return pretty fast (mostly < 1 ms), although sometimes they take 100-200 ms. - When the calls to the GenServer get slow or timeout, there’s usually a multi-second delay between the client sending the message and the server’s
handle_call
function getting called.
That delay seems to be the source of our problem, but since the message queue isn’t getting backed up I don’t really understand what could be causing it. Two ideas I’ve had are:
- A NIF blocking the scheduler. We were using jiffy for our JSON encoding (since it’s noticeably faster than poison for large payloads), and, AFAIK, it’s the only NIF-based package we are using. I swapped it out for poison to see if removing it would solve the issue and it did not.
- Garbage collection. I’m not sure how to measure this to confirm or rule it out, but I’d be very surprised if it was the root cause. This GenServer only deals with thousands (e.g. < 10K) of 15-character string IDs, pids and refs, plus the maps that hold those. So there aren’t any real big data structures to get GC’d.
Beyond that, we don’t have any ideas for what could be going on. Anyone have any ideas? What are some additional troubleshooting steps we can take to identify the delays between the clients sending the message and the GenServer acting on it?
Thanks!