Design question for maintaining state

awochna · February 21, 2017, 4:13am

So, I have built a library to make HTTP requests with a circuit breaker (https://github.com/awochna/breaker).

The question I have is about maintaining the state and configuration of that breaker. I want to do this in a way that is both efficient and idiomatic. Currently, it looks something like this:

 # Builds a new request circuit breaker for making requests to "http://example.com"
iex> Breaker.new(%{url: "http://example.com"})
%Breaker{headers: [], status: #PID<0.67.0>, timeout: 3000, url: "http://example.com"}

I understand that making the argument passed to Breaker.new/1 a Keyword List is probably more idiomatic than using a Map, but is returning a %Breaker{} struct something that would be idiomatic in this case?

The status property there is an Agent process and it holds the current counts and error threshold for determining the open or closed status for the breaker. Here, I chose an agent because it seemed like an efficient interface for holding state that would be constantly changing. The information I chose to store in the state of the agent process is only what is constantly updated or required to calculate the breaker’s status. All other, more “static” information (such as headers sent with every request, the default timeout, etc) are stored in the %Breaker{} struct.

Is this separation (using an Agent for storing the necessary and changing parts of the breaker, using a map for the unchanging parts) either idiomatic or necessarily more efficient?

In a way, since the Agent seems to be the limiting component of this system, it would make sense to say that the less state it needs to hold/compute the better. Then again, having all of the information for a single breaker in separate locations seem to make things unnecessarily complicated.

Would the whole system be better served if the library simply started a new GenServer that held all of these state components? That seems to be far more idiomatic, but could that slow things down too much? Should I try to keep all unnecessary calculations out of the GenServer callbacks so that the calling process is calculating as much as it can without potentially blocking other processes that need access to the breaker’s GenServer?

Any help for some of these questions would be greatly appreciated. As would any resources (books, screencasts, etc) about how to weigh these options or about Elixir application design.

aseigo · February 21, 2017, 9:32am

Since the breaker is shared across multiple (possibly concurrent?) requests, it makes sense to have it in its own process … bet it an agent or a GenServer. both are pretty light-weight. If you think of processes as objects (instances-of), then your process is the object that contains the shared data. Fair enough …

An interesting question perhaps is whether the operations on that shared state need to be syncronized. So, if multiple concurrent http requests are going on and they both want to update the breaker’s shared state, then the Agent (or whatever) does need to process the requests one at a time to serialize those requests properly. If that needs to happen synchronously, then it needs to block. If it can happen async, then it could be done by passing messages to the shared-state-holding process and handled one at a time there.

I haven’t looked at how breaker is implemented, but I can imagine various scenarios:

a request asks the breaker if it should go ahead with the request … this is necessarily syncronous, but should just be returning a single value (e.g. an atom) and should be FAST … so probably no issue there in terms of resource contention.
upon completion the request updates the shared state with its results … that could be async, which means that the breaker will still be available for use until it gets to processing that message. however, that means that if the breaker should change state, it is possible for a request asking for the breaker’s status to get the “wrong” value because the async message hasn’t been processed yet. however, that would probably give you the best resource sharing.

So deciding how updates and reads to the shared state will occur (sync-blocking / async-non-blocking; updates in the calling process or in some other process …) and how that relates to predictability and resource contention in the system.

THAT said … you’d really be surprised how fast these things are on the BEAM. And unless you are doing a LOT of concurrent fetches, this is unlikely to be a bottleneck either way. It would make sense to measure where time is being spent, and how much of the total run time is being spent in those shared state fetch/update calls. GenServers are not so slow unless you REALLY push them, at which point you can always drop down to spawn’ing functions manually. I doubt you’ll see much of a difference between a GenServer and an Agent in terms of performance unless you really push the system pretty hard. The one possibly nicer thing about a GenServer is you can wrap all the business logic of the shared state neatly inside of it, though I assume that is already done in Breaker so probably not much to gain there.

Side note: Agent suppots both “server” (in the Agent process) processing of the shared state, and “client” side (in the calling process). The different is that the “server” calls block the Agent, while the “client-side” calls do not … BUT, the client-side functions cause a COPY of the shared data to be made to the calling client. This can be expensive with lots of data, and it can make syncronizing changes a dicey affair.

So … yeah … how is the performance of it right now?

awochna · February 21, 2017, 6:06pm

Thanks for the reply, @aseigo!

Regarding which calls are sync and async, you’re pretty much spot on.

The call to the Agent to see if the breaker is open (shouldn’t continue with the request) is sync and only returns a boolean, so it shouldn’t block for long. My benchmarks for this have an average time of about 7 microseconds.

Actually making the request is handled in a separate Task, allowing for multiple concurrent requests without the breaker really caring until the request returns. At that point, the Task sends a message back to the Agent to update it’s counts before returning the response to the user. The processing done by the agent here is sync, but still pretty fast at about 9 - 9.5 microseconds. You hit on the pros/cons I thought of in your second bullet.

Thanks for the analysis and explaining the performance differences I’d probably see between GenServer and Agent (probably none ). And your side note, I didn’t think about having the client do the processing for updating the Agent’s state.

I decided to create some benchmarks for the full round trip using a local HTTParrot. Successive GET requests against a healthy site take about 574 microseconds. The same request against an unhealthy site take about 25 microseconds.

You’re right, that’s a lot faster than I thought it would be. I think that might just be something I have to get used to, coming from the PHP/Ruby/Node.JS world.

It sounds like, then, using an Agent or a GenServer isn’t likely to make much of a difference. Is it more idiomatic to return a pid from the example above than a Map containing a pid for the purposes of tracking/supervising/naming? I think that’s the direction I’ll end up going for the 1.0.0 release as it seems to make more sense.

BogdanHabic · February 22, 2017, 10:21am

Hey,

@aseigo said it all, but I would like to add one more thing.

Instead of returning a pid to the caller, you can maybe return a binary() or even better, use the url (or the authority/domain) when calling the agent, and then use the Registry module (with :via tuples) to do the bookkeeping of the pids.

With this you don’t have to worry about your clients having a “stale” pid.

aseigo · February 22, 2017, 12:12pm

Passing PIDs around is pretty common if the process does not otherwise have a name (aka is registered, as @BogdanHabic notes above), and quite safe unless that PID is held onto and it crashes … but that can also be OK, and even desired, in the case that you want a crash in that process to “bubble up” to its users. So, if your breaker process dies and that should cause the Tasks using that breaker to crash when it has crashed, then a PID is a fine choice.

If, however, the process will be restarted by a supervisor on crash, then it is indeed much safer (and even idiomatic) to refer to that process by a registered name.

HTH

awochna · February 23, 2017, 6:40pm

Those do help, thanks guys!

I think that the GenServer route might be the way to go. I think people are going to want to supervise their breakers and give them names and it seems much easier to do that when all of the state is wrapped up in a single process.

I went ahead and converted Breaker on a different branch and ran some benchmarks to determine the speed difference between the two.

## BreakerAgentBench
manually recalculate the circuit status      0.11
count a timeout                              0.19
roll the health window                       0.2
count a miss                                 0.21
count a hit                                  0.22
ask if open                                  0.8
manually trip circuit                        0.83
manually reset circuit                       0.83

## BreakerBench
create a breaker                             1.03

## BreakerHealthyRequestBench
get request with breaker                     2.44
get request without breaker                  2.8

## BreakerTimeoutRequestBench
get request without breaker                  1.0
get request with breaker                     1.32

## BreakerUnhealthyRequestBench
get request with breaker                     1.54
get request without breaker                  3.35

The first run was using Agent and the second was using GenServer

A lot of the lower level API calls (counting a response, rolling the health window, recalculating the status) benefited a lot from now being casts and happening async.

What’s surprising though, is that the lower level calls that still remained sync (asking if open, tripping, resetting) are at about 80% of the time it would take the Agent to do the same thing. However, making an actual HTTP request got slower, even though not a whole lot of that logic has really changed. I haven’t been able to figure out why yet and I’ll have to do some more digging and debugging, but I don’t think that’s part of this topic.

It’s certainly strange though, that even the requests without using the Breaker (just using HTTPotion directly) are still slower. The requests were made against a local, fresh HTTParrot to help mitigate network latency issues.