Websocket not connecting and page in infinite reload on AWS

abbyjones · June 1, 2022, 8:19pm

I’m working on a proof of concept application to introduce Elixir and Phoenix at my company, and I’m running into a problem with deploying it to our AWS environment.

I’m building the application into a Docker image that deploys to AWS Fargate. There are two instances running on Fargate, with an Application Load Balancer and Web Application Firewall (WAF) in front. The application is configured to use HTTPS all the way through. The image builds and runs as expected on my local machine.

When I deploy to AWS, a static page with no websockets works as expected. When I visit a page that does have websockets, the page cycles through the same series of events in an infinite loop.

initial page loads with a 200
css/js/images/fonts load with 200’s
client calls the websocket endpoint with the phx_join message
server responds to the websocket call with phx_reply, and the response is {response: {reason: "stale"}, status: "error"}
page reloads and cycle begins again…

This is my current endpoint config in runtime.exs:

config :application_name, ApplicationNameWeb.Endpoint,
    server: true,
    url: [host: host, port: port, scheme: scheme],
    http: [
      # Enable IPv6 and bind on all interfaces.
      # Set it to  {0, 0, 0, 0, 0, 0, 0, 1} for local network only access.
      # See the documentation on https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html
      # for details about using IPv6 vs IPv4 and loopback vs public addresses.
      ip: {0, 0, 0, 0, 0, 0, 0, 0}
    ],
    secret_key_base: secret_key_base,
    check_origin: [
      "//asi-app-name-dev-alb-719194575.us-east-1.elb.amazonaws.com",
      "//fargate.asi-dev.cld.company.com/context-path/",
      "//fargate.asi-dev.cld.company.com/"
    ]

And in prod.exs:

config :application_name, ApplicationNameWeb.Endpoint,
  cache_static_manifest: "priv/static/cache_manifest.json",
  https: [
    port: 443,
    otp_app: :application_name,
    cipher_suite: :strong,
    keyfile: "priv/ssl/private/selfsigned.key",
    certfile: "priv/ssl/certs/selfsigned.crt",
    # Allow self-signed certificates
    verify_fun: {&CertUtil.verify_fun_selfsigned_cert/3, []}
  ],
  static_url: [path: "/context-path"],
  force_ssl: [hsts: true, host: nil]

Things I’ve tried so far:

Dropping from one instance to two - no change
Checking WAF logs - doesn’t look like any requests are getting caught there
Talked to our DevOps team about using Network Load Balancer instead of Application Load Balance - both support web sockets
Tweaked the values in check_origins and double and triple checked them against the ALB URL and deployed URL - everything seems right
Adding a function to enable self-signed certificates since we use one in the Docker image itself, following the instructions here - I think this resolved an earlier error with the handshake that was being logged to the server, although I was trying so many things that night that I’m not sure anymore. The error message was “TLS :server: In state :hello at tls_record.erl:558 generated SERVER ALERT: Fatal - Unexpected Message”, and it hasn’t shown up in the logs again in the last week.

The other strange thing in the logs is that I’m seeing this message over and over again:

TLS :server: In state :certify received CLIENT ALERT: Fatal - Unknown CA

I’m seeing it spamming the logs even after I’ve navigated away from the page, which makes me think it’s not related to this issue, but tossing it out there just in case.

I feel like I must be missing something small, but I’m not sure where to look next. I’ve really enjoyed studying Elixir/Phoenix and appreciated this forum as I’m learning. Any ideas would be welcome, as this deployment is a critical step for bringing Elixir into my company, and I will be so excited if that happens.

Thanks

evadne · June 1, 2022, 10:40pm

Hope this helps

AWS side: Offload TLS to ALB in front of Fargate (use ACM to issue a certificate), use HTTP to talk with the app
Elixir side: Drop the TLS bits in favour of offloaded TLS termination at ALB
Elixir side: Only use public URL host in check_origin, drop all others, only use host, don’t include paths, drop the one that is your internal ALB

No insight into why you had TLS termination done on Elixir side at all, but this can be worked around to achieve dev/prod parity by using Nginx sidecar outside of prod, which handles termination.

abbyjones · June 1, 2022, 11:28pm

Hi, thanks for the ideas. My company’s non-functional requirements specify that we have to use SSL from end to end, so I don’t think I can implement those suggestions in this environment.

But I’m not familiar with nginx sidecar, so I can look into that and see if it’s an option.

evadne · June 1, 2022, 11:30pm

In which case would be prudent to use the actual certs you deploy locally and test with openssl against your app on port 443, to see if the cert is actually picked up, and then make sure everything in AWS is configured to be dumb and doesn’t attempt to do things such as terminate TLS for you etc… Good luck!!

FWIW you might have to use NLB to keep traffic TCP-only until it hits the target

This scenario is very similar to your current setup amazon web services - AWS Network Load Balancer SSL passthrough - Stack Overflow

abbyjones · June 2, 2022, 3:32am

I am way out over my skis here. Thanks for helping me think this through.

There are two SSL certs involved. When the Docker image is created, it uses OpenSSL to generate a self-signed certificate that I think gets used from the ALB back to the Phoenix app. When the Fargate cluster is created, an AWS certificate is generated and is used between the client and the ALB.

I re-checked the purely static page (no LiveView) and confirmed that it’s using the https endpoint. If I try to reach it on the http endpoint, I get a 307 Internal Redirect to the https endpoint, and then the 200. Based on this - I believe everything is ok with the certificates, and the TLS connection is being carried all the way through.

I did talk to someone from DevOps about using NLB yesterday. He steered me back towards ALB because the AWS documentation says it supports web sockets, and we can’t put a WAF in front of an NLB. But he also said he’s never configured an ALB with web sockets before, so I might need to open a support ticket with AWS to have them look at it.

I tried one more deployment with check_origin: false just to rule out any issue with the URL’s I was passing in, and the behavior was exactly the same.

It sounds like all the bases are covered from an Elixir/Phoenix/LiveView perspective. I’ll open a ticket with AWS and see if there’s something more we can do in the ALB config. I’ll post an update if I learn anything helpful.

Thanks again for your help!

darnahsan · June 2, 2022, 6:12am

on AWS ALB do not guarantee persistent connections and you can/will suffer connection breaking and reconnecting as ALB underneath scale in and scale out based on user traffic. You should use a NLB also NLB supports TLS even being L4 and guarantees persistent connections end to end. Even though this might not be your current problem but once you put traffic through you would end up with this problem so best to start with an NLB and avoid figuring what causing broken connections then

abbyjones · June 2, 2022, 3:54pm

Thanks. I’ll make sure I talk to the AWS rep about that also.

tj0 · June 2, 2022, 7:16pm

Usually end-to-end encryption means from the client/user to the ssl terminated endpoint which in this case looks like the ALB. For most applications, the internal traffic can be routed over a VPC and it will suffice for security reviews if you have a separate security team.

If your company is nuts about security and zero-trust, you may have to setup SSL or mutual SSL with a private CA which is an entirely different ball of wax.

However, most of these things are a matter of negotiation and VPC + ALB termination should be fine for most applications.

abbyjones · June 6, 2022, 7:01pm

I’m still working through this, but I’ve learned a couple of things that are good to know:

With Application Load Balancer, web socket connections are sticky by default, so you don’t have to tweak the sticky settings. AWS Docs for reference
Per the AWS support rep, scaling of the ALB shouldn’t cause issues with dropping connections:

when ELB decides to scale out, it will only remove the nodes when there is no more active connections on that node. When new nodes are creates, old node will no longer receive any new connections. It will wait for all current ongoing connections to complete before the node is terminated.

I think I’ve seen this in action while deploying updates to my code. Sometimes it takes a long time for the previous nodes to be terminated when they new ones are coming up. This would make sense if those nodes have active connections.

I learned that the LiveView JavaScript is triggering the page to reload when it gets the error message from the server with the reason of stale. source code

And I found this note explaining the behavior in the LiveView Channel source code:

   defp load_live_view(view) do
    # Make sure the view is loaded. Otherwise if the first request
    # ever is a LiveView connection, the view won't be loaded and
    # the mount/handle_params callbacks won't be invoked as they
    # are optional, leading to errors.
    {:ok, view.__live__()}
  rescue
    # If it fails, then the only possible answer is that the live
    # view has been renamed. So we force the client to reconnect.
    _ -> {:error, :stale}
  end

There are a few places in the server-side LiveView code that can return the stale error. Still digging through to figure out what is being added, dropped, or changed between the client and server to trigger the error.

abbyjones · June 16, 2022, 3:44am

I found the solution. It turns out that CloudFront doesn’t forward cookies by default. The CDK construct I was using should have turned that behavior on, but it was using a deprecated property that wasn’t being executed correctly.

I went into the AWS Console and manually updated the CloudFront origin behavior to forward cookies, and the web socket began to work.