Troubleshooting a slow GenServer

myronmarston · March 1, 2017, 4:45pm

Our Elixir application has a particular GenServer that all requests to particular API endpoints go through. Given that, we carefully designed the GenServer to ensure that all of its operations are very fast. We limited it to simple operations that, from what I understand, are all constant-time:

Map.pop, Map.get, Map.put
Process.monitor
Supervisor.start_child (I know this could be a long operation if the child’s init/1 callback takes a long time, but it only initializes a state struct, sends itself a message for follow-up init, and sets Logger metadata)

Recently, we’ve been noticing that calls to this GenServer can sometimes be quite slow–to the point where they sometimes timeout (with the default 5 second timeout). We’ve been trying to identify the source of the slowness for over a week (by adding additional logging and also inspecting a live node using the observer and recon) and thus far have been stumped. Here’s what we’ve learned so far:

The GenServer’s message queue is not getting backed up. At one point I put something in place that was logging the queue length once per second and we never saw it go above 3. According to our logs, this GenServer only handles about 10-20 messages/second at peak, which is in line with what we would expect given our level of traffic (which is not particularly high).
The GenServer handle_call and handle_info callbacks generally return pretty fast (mostly < 1 ms), although sometimes they take 100-200 ms.
When the calls to the GenServer get slow or timeout, there’s usually a multi-second delay between the client sending the message and the server’s handle_call function getting called.

That delay seems to be the source of our problem, but since the message queue isn’t getting backed up I don’t really understand what could be causing it. Two ideas I’ve had are:

A NIF blocking the scheduler. We were using jiffy for our JSON encoding (since it’s noticeably faster than poison for large payloads), and, AFAIK, it’s the only NIF-based package we are using. I swapped it out for poison to see if removing it would solve the issue and it did not.
Garbage collection. I’m not sure how to measure this to confirm or rule it out, but I’d be very surprised if it was the root cause. This GenServer only deals with thousands (e.g. < 10K) of 15-character string IDs, pids and refs, plus the maps that hold those. So there aren’t any real big data structures to get GC’d.

Beyond that, we don’t have any ideas for what could be going on. Anyone have any ideas? What are some additional troubleshooting steps we can take to identify the delays between the clients sending the message and the GenServer acting on it?

Thanks!

andre1sk · March 1, 2017, 5:00pm

Is the client local or talking over the network?
what are you running this service on as in VPS, Dedicated hardware, EC2 instance?

josevalim · March 1, 2017, 5:12pm

Are you measuring the run_queue as well? :erlang.statistics(:run_queue). If the run_queue is a high number, it means many processes want to run so it takes time for this particular GenServer to have its turn. You can also measure the amount of reductions in the process in case it is doing something that is computationally expensive.

Finally, if you have a map, especially a large map, be careful if you are doing any operation that requires enumerating the map, as it currently requires the map to be converted to a list first.

myronmarston · March 1, 2017, 5:39pm

Is the client local or talking over the network?

Local on the same BEAM node.

what are you running this service on as in VPS, Dedicated hardware, EC2 instance?

On a VM on dedicated hardware in our data center.

myronmarston · March 1, 2017, 5:44pm

I did’t know about :erlang.statistics(:run_queue) before now, so nope, I haven’t been measuring that. I’ll give that a shot. In reading the docs for :erlang.statistics/1, it mentions that :erlang.statistics(:run_queue) can be computationally expensive, but that :erlang.statistics(:run_queue_lengths) and :erlang.statistics(:total_run_queue_lengths). Should I take that to mean that while :erlang.statistics(:run_queue) is fine for one-off checks from an attached console, I shouldn’t leave that in place as a periodic check but should instead use of of the others?

Finally, if you have a map, especially a large map, be careful if you are doing any operation that requires enumerating the map, as it currently requires the map to be converted to a list first.

Yep. We’re careful about that

myronmarston · March 7, 2017, 5:18am

@josevalim’s suggestion to inspect the run_queue helped out here–it turned out our server was just overloaded, and it helped us figure it out.

iwarshak · June 6, 2021, 3:09pm

@myronmarston I know this is an old issue, but what was your solution to this? A bigger server?

myronmarston · June 6, 2021, 4:15pm

Sorry, it’s been 4 years, I haven’t worked with Elixir in 3.5 years and I honestly don’t remember!

josevalim · June 6, 2021, 7:40pm

A bigger server or more servers to distribute the load should do it!