In my current PR, I’d like to allow for better control over the runner pod manifest. The current approach basically offers 2 ways of controlling the runner pod manifest.
In the simpler case you can just define env vars and resource requests/limits for the runner pods. The FLAME backend then creates the runner pod with these values set.
If you need more advanced features like pod affinity (e.g. running on GPU nodes), volumes etc, you can implement a callback in which you build the runner pod manifest in your application and return it to the FLAME backend. The backend then adds soem required env variables, set/overwrite a few values like the pod name, container image, etc. and finally apply it to the cluster to create the runner pod.
If the URL in the SA token is an IP (not a FQDN), hostname verification fails with :verify_peer as there is no way to verify the hostname in the cert. This is the case e.g. on my local Kind cluster…
So… I’ve created a PR that removes the insecure_skip_tls_verify option in favour of setting server_name_indication to :disable if KUBERNETES_SERVICE_HOST is an IP address (instead of a FQDN).
However, I’d really like a “security audit” on this. I think this is as safe as it can be. I mean… no SNI, no hostname check. So we might as well disable it automatically, no?
Then again, I was surprised to see even AKS (Azure) settting KUBERNETES_SERVICE_HOST to a FQDN if and only if you add an annotation to your pod!
Maybe I should do something like Erlang does for verify_none: Keep the option in place, but if it is not set and I’m setting server_name_indication to :disable, print a warning.
Setting server_name_indication: :disable not only drops te SNI extension from the Client Hello message sent to the server, it also disables hostname verification altogether. So while the client still checks if the server is presenting a certificate that was issued by a trusted CA, it does not check if we have reached the server we intended to reach. That’s arguably better than verify: :verify_none, but I think we can do better still?
What identities does Kubernetes put in the certificate that the server presents, in the Common Name field of the Subject and in the SubjectAltNames extension? If the IP address appears anywhere and you connect with an IP address in the URL, then the default behavior of ssl (without :server_name_indication option) should be to try and match that IP.
One way to check what identities are being checked would be to pass the following :ssl option: customize_hostname_check: [match_fun: fn a, b -> IO.inspect({a, b}); :default end]
Now the IP Address seems to be a binary. But looking at the Erlang code, I think it’s expecting a charlist, no? length() and list_to_tuple() are list operations, no?
This has been bugging me for so long now (I’m also maintaining the k8s library). If this could be fixed, it would be awesome. WDYT @voltone? I can also open an Erlang issue for this.
That’s the correct encoding of an IPv4 address according to the X.509 spec. It gets decoded to a 4-tuple elsewhere during hostname verification.
So it seems :ssl treats a string/charlist value in the first argument of :ssl.connect/3 as a hostname and tries to match it against the hostnames in the certificate. So unless the IP address also appears as dNSName: ~c"10.0.0.1" it is not going to match. If you call :ssl.connect/3 with a tuple as the first argument (e.g. {10, 0, 0, 1}) everything works as expected.
Now, you are not calling :ssl.connect/3 directly, your HTTP client library parses the URL and handles the connection establishment, so you can’t pass a tuple. Unless you want to propose upstream changes to the way the TLS connection is established when a URL has an IP address instead of a hostname, you could add the mapping to the hostname verification:
def custom_hostname_check({:dns_id, hostname}, {:iPAddress, ip} do
case :inet.parse_address(hostname) do
{:ok, ^ip} -> true
_ -> :default
end
end
def custom_hostname_check(_, _), do: :default
And then select this function by passing customize_hostname_check: [match_fun: &custom_hostname_check/1].
If you call :ssl.connect/3 with a tuple as the first argument (e.g. {10, 0, 0, 1} ) everything works as expected.
True. That works.
you could add the mapping to the hostname verification
I see what you mean. Although this function won’t work as :inet.parse_address(hostname) will return {10, 0, 0, 1} and ^ip is [10, 0, 0, 1]. But that’s solvable, e.g. like this:
def custom_hostname_check(refId, presId) do
with {{:dns_id, hostname}, {:iPAddress, ip}} <- {refId, presId},
{:ok, ip_tuple} <- :inet.parse_address(hostname),
^ip <- Tuple.to_list(ip_tuple) do
true
else
_ ->
:default
end
end
However, I still think this should be “fixed” in some lower abstraction. Maybe Mint? or even lower? I mean… OTP does accept IP addresses as charlists after all, no?
You could try and open an issue against OTP, arguing that ssl:connect/3 should recognise a binary/string representation of an IP address and handle it the same way as a tuple.
If that gets rejected you could try Mint instead. In that case I will probably be asked to review the issue
Have the change working on a local branch. If the OTP PR gets rejected or… ignored… I will push that. In any case. I’ll link this discussion and vice versa.
I have implemented the workaround in flame_k8s_backend for now until this issue is fixed in one of the lower layers. GH issues are open on OTP and Mint. Thanks a lot @voltone for your help and let’s take this discussion to GitHub now.
In version 0.4.1 released yesterday I have removed Req - the last dependency besides Flame. Now FLAME can safely be used in Livebooks running on Kubernetes.
I have a question related to the startup times of new pods. I was reading the original article for Flame and starting up and connect a new machine takes about 3 seconds on Fly.io infrastructure.
Did anybody tried Flame with K8s backend on an AWS EKS cluster? How fast is the process of starting and connecting a new node?
I don’t think there is a hard guarantee on what kind of hardware/orchestration they use behind the scenes, so it might be unpredictable based on the region, hardware, software versions they use.