Process Monitor in Fargate

Hello, everybody. (It’s my first post!)

I have a Plug Cowboy (no Phoenix) application that has two end points: one to receive a POST, create a payload and send an HTTP request to a service, and another to receive a POST from that service with results. I’d like to figure out a way to monitor each request I make, so that if the service never responds, I can handle it. What’s the best way to handle that? A GenServer that has the request.id? Then when I receive the postback, I can check against them and kill the one with that associated id, or, if they never respond I can make a repeat call?

ALSO, I am also deploying this as a Docker image on Fargate, with at least 5 instances to be run at once.
What’s the best way to then “share the monitoring” from instance to instance?

Thanks!

1 Like

If I understand correctly what you are looking for is distributed tracing. There are few libraries for that I am working on OpenCensus and there is opencesnus_plug integration already available for you.

1 Like

To be honest, I have never heard of distributed tracing, so I’ll start reading… Thank you for the links!

First a question, then a suggestion:

  • Do you have to organize things this way? Attempting to coordinate multiple http-requests across multiple nodes is going to introduce complexity that could be avoided merely by having the second server respond to the request with the response rather than an http callback.
  • If you, do, then you’re going to need some sort of coordination mechanism. The simplest is probably a shared queue of some sort where you dump the response of the second post, and then each server looks at them all and decides if it is connected to the client that cares about it or drops them if it isn’t.
  • Unfortunately I do have to organize that way. The service I call doesn’t belong to me and only uses postbacks.
  • My question is exactly what you’re saying… How do I create that “coordination mechanism” and what does that look like? Setting up a whole DB for just that seems like overkill. I would like to take advantage of Elixir’s process monitoring, but don’t know an elegant way to make that cross instances.

Your problem is a many-to-many mapping problem since you’ve got n servers waiting for callbacks that may come in to any of those n servers.

There are various ways to solve it by distributed communication and distributed state management, but I find them to be more trouble than they’re worth for simple cases (others may disagree). I find this to be particularly true in containerized environments where ephemeral servers are the norm because such mechanisms rely on server identity and this bumps up against the pets-vs-cattle model that is prevalent in the container world.

That was why I suggested a shared queue that all the clients listen to, and simply drop messages they don’t care about. You can have some sort of node specific state that holds references to client ids to determine if you care; perhaps an ets table or a gen_server that holds the connections. This approach only scales so far, but up to double-digit servers it should be fine.

I’m assuming that you’re on Amazon since you mentioned Fargate, so I might look at Amazon SQS for a dead-simple queue that has an HTTP API.

I totally understand where you’re coming from, I really do. I was just hoping that there would not be a distributed communication system that was more trouble than it’s worth. I am actively trying to reduce our reliance on outside systems, such as SQS.
If you have any ideas on the distributed communication and distributed state management, I would most certainly appreciate them, but to be clear, this is the direction in which I am taking the project.

Don’t have a good answer for you, but the first problem you might want to look at is dynamic node discovery and tracking inside Farscape. Nodes will come and go and you’ll need some mechanism for connecting them with each other. The Service Discovery Mechanism [ https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-discovery.html ] might be of help there. I don’t know of anything that will do that for you OOTB, but others might have better suggestions.

Once you can reliably track node addition and removal, you might be able to use it to dynamically add nodes to and from a shared Mnesia cluster, which has an activity callback interface if memory serves. That might be all you need to respond to messages directed at that node.

Elixir is a very fine, sharp and shiny knife. I wish you well in your quest to eat soup with your knife. :smile:

I just watched this video [ https://www.youtube.com/watch?v=nLApFANtkHs ] which covers a lot of this area. Hopefully it helps. :+1:t2:

1 Like

That talk hit so close to home! To be utterly imprecise about it, it just feels wrong that the two things haven’t yet found a happy medium.

For fear of sharing an idea before it’s finished incubating, I think I’m going to explore using a small EC2 instance to hold my Process Monitor, and Fargate for the code itself. I can still scale the Fargate instances, but the GenServer that keeps the processes won’t need to be as flexible. That way the monitor can have a static hostname and I can just send it messages from the various instances.

Now I just need to make it, and make it work.

What you’re describing is, basically, zookeeper [ https://zookeeper.apache.org/ ]. Zk, however, uses multiple nodes to provide resilience and continuity of service in the event of node failure. Without that, you end up injecting a single-point-of-failure into your system whereby loss of the single coordinator nodes leaves your other nodes unable to find each other.

Other things you might want to look at if you’re going to try to build this yourself include Hashicorp Serf, Memberlist and Consul [ https://github.com/hashicorp/serf ] [ https://github.com/hashicorp/memberlist ] [ https://github.com/hashicorp/consul ].

2 Likes