Folks,
My Phoenix LiveView app is meant to scale horisontally and keep access latency down by running as distributed across as many Kubernetes clusters as needs be. So far load balancing at the Kubernetes level has been enough but curtesy of libcluster I’m ready to jump in with tweaks to make sure LiveView sessions gets back to the same node if possible.
But when it concerns how my regional clusters should contact each other there’s a whole array of ostensibly competing concepts and tools to consider and it’s very confusing. Some of the terms, packages and complications in this space are:
-
Service mesh tools. Istio, linkerd, meshx. What is a service mesh in an Elixir/Phoenix world anyway? I get services and microservices at the HTTP(S) API level, sure, even what José José refers to in his article about Kubernetes and the Erlang VM but in practice in an Phoenix setting how do they play out?
-
Network connectivity. The inside of a Kubernetes cluster, depending on the container network interface (CNI) you use makes extensive use of pretty standard networking principles and protocols to give each component their own private IP and mDNS name. The cluster only links up with “real networks” through a select few ingress and egress points which could (if that’s how you configure it) have public IPs or are reachable via public IPs through some load balancer or ingress controller. The whole (multicast DNS) namespace (and potentially the IP addressing scheme) inside the cluster could potentially be the same in all your clusters because their scope is that isolated. There are packages out there such as submariner/lighthouse that seems to take it upon themselves to provide secure (mutual TLS) networking between clusters. E.g. submariner is in early development, therefor not yet production ready even if that is indeed the way to go.
-
It’s also possible to simply use what comes in with Phoenix and expose a TLS secured API routed from each cluster’s public, probably load balanced IPs through which clusters ask and answer each other questions. One could probably implement some JWT token exchange to manually achieve something akin to mutual TLS.
-
There are also those who state that after fooliing around with the available options they ended up going with RabbitMQ for their purposes. Now I know as Erlang clan we’re got a built-in soft spot for RabbitMQ but it’s such a generalised solution that I’m genuinely worried that I’d have to adjust my intended solution to what queue types and topologies RMQ already offers as it might be too hard to work my special distributed database and processing logic into RMQ.
-
Apart/on top of the gaggle of people who’d suggest that using a common database as a communication method between parts of your (distributed) application there are of course also those who suggest using Redis for that purpose. I’m old enough to have practically invented that approach myself when a common database was what I had to work with. For this application I don’t have a common database and don’t intend to deploy one for starters. Each region runs its own database and I’m building the layer that turns those into some sort of federated database at the application layer. That’s what the regions are “talking to each other” about.
-
It’s also conceivable to rig up some MPLS or VPN-style solution completely independently of both Kubernetes and Elixir/Erlang which sets up secure connections between the networks each of the clusters consider their “external networks” so that they can connect using layer 2 without having to worry about security over public networks.
So, folks, please, talk to me. Help me make sense of my many half-options so I can settle on something that is production-ready now, won’t absorb all my cognitive capacity to set up and run, slots easily into the Phoenix and LiveView idiom.
In my ideal world I end up with a simple yet safe way to
a ) look up stored contact details for my application in a different region,
b) compile a requrest for the data I want,
c) sign and send the request off,
d) get a signed result which I’d
e.1) use and discard it (if I did a “fetch” call) or
e.2) cache and reuse it (if I made a “subscribe” call) in which case I could
f) expect a callback when the data changes, until
g) unsubscribe from changes when the data ages out of the cache.
Similarly, the same application code on the other end of the call would:
a) Get a signed request for data from another node
b) Fetch the data and
c.1) Sign and send the results to the requestor if it’s a fetch request, or
c.2) Register the requestor’s subscription to that data and then sign and send the data
d) Monitor the database layer for changes to data that has active subscriptions and
e) Compile, sign and send data updates to all subscription holders.
The plan would be to build a little magic into the way requests aggregate and updates propagate through what could end up to be a rather big network of clusters to reduce the point load on busy regions and build an effective self-healing/configuring multi-path solution, but that’s not code I’d dare implement until I have sufficient live regions to warrant the test rig it would require.
What (in my view at least) far more trivial than most of the packaged solutions try to cater for is that it’s one single application (with as many components as I want) running everywhere. All identical except for which region id they consider their own. I don’t have “foreign” applications, services, endpoints, processes or APIs to contend with. The only actual complications are network latency, capacity and needing to use routable IPs as sparingly as possible (IPv4).
P.S. For my networks I master my own DNS zones.