Phoenix Channels in Distributed Systems

tfischbach · July 12, 2018, 8:53am

Hi,

currently I am testing running a Phoenix application on a system with multiple connected nodes. There are some observations I’d like to confirm/questions I’d like to ask that I could not find documented anywhere.

Consider the case where requests are load balanced between multiple phoenix applications on different (BEAM) nodes (via an external load balancer i.e. Docker Swarm ingress load balancing or HAProxy):

When using websocket transport, channel communication is performed via one persistent HTTP connection which the load balancer routes to one of the Phoenix instances. The life span of a “channel session” (which begins with join and ends with terminate) is identical to the life span of that connection. Correct? Does Phoenix persist any channel related state that spans multiple such connections?
When using long polling, the channel communication is divided into many requests. If the load balancer routes these to different Phoenix instances, I see errors in my JS client. Once Phoenix instances run on connected nodes, I so far was no longer able to reproduce these problems. Does Phoenix delegate channel traffic to the correct process across node boundaries? So far it looks like it does, but I could not find this documented anywhere.
Assuming 2. is the case, what happens when one of the nodes disconnects because it is stopped (e.g. during a rolling update deploy)? Is there anything my application needs to do to ensure connections terminate gracefully (particularly in the long polling case) so that the client can establish a new connection?

Any insights on these topics or links to resources covering these questions would be greatly appreciated.

idi527 · July 12, 2018, 3:56pm

Please take these answers with a grain of salt since it’s been some time since I used phoenix. But I thought I’d write them down before someone more knowledgable comes around.

http connection?

Just to be clear, websocket connections are initiated with the help of an http upgrade request, but they operate over a tcp connection (just like http, though).

The channel process lifespan (channel session) is shorter than (or equal to) the lifespan of a Phoenix.Socket which supposedly corresponds to that of a cowboy process handling the tcp connection. I’m not aware of any persistence logic in phoenix channels which would’ve worked by default. You can persist the channels/socket states yourself though.

When using long polling, the channel communication is divided into many requests. If the load balancer routes these to different Phoenix instances, I see errors in my JS client. Once Phoenix instances run on connected nodes, I so far was no longer able to reproduce these problems. Does Phoenix delegate channel traffic to the correct process across node boundaries? So far it looks like it does, but I could not find this documented anywhere.

If the tcp connection is kept open (which is true by default in http/1.1), then there is no additional routing required from the load balancer and all your http long-polling requests go to the same node. It works just like websockets. If the connection is closed (if there is connection: close http header), then yes, the next request might be routed to another node in the cluster by the load balancer. Phoenix PubSub can route the messages within the cluster with the help of pg2, though, which is what probably happens in

Once Phoenix instances run on connected nodes, I so far was no longer able to reproduce these problems.

Assuming 2. is the case, what happens when one of the nodes disconnects because it is stopped (e.g. during a rolling update deploy)? Is there anything my application needs to do to ensure connections terminate gracefully (particularly in the long polling case) so that the client can establish a new connection?

The client would establish a new connection after a disconnect automatically (at least in phoenix.js). Whether the node to which the new connection is established has the previous channel state is mostly up to you, I think.

via an external load balancer i.e. Docker Swarm ingress load balancing or HAProxy

I like haproxy