HTTPS on ECS with docker, Let's Encrypt with no http server

brundozer · July 26, 2020, 12:45pm

Hi,

I’ve been desperately trying to make my phoenix app work with HTTPS and despite the huge amount of resources available on this forum and other websites about this, I haven’t yet been able to do so.

The website works just fine in HTTP, on port 80, but I keep getting an ERR_CONNECTION_RESET when trying to access it on HTTPS. I am not using any reverse proxy, just Cowboy’s http server.

Here is what I get when running curl https://www.allremotedevjobs.com/ --verbose:

*   Trying 3.104.92.194...
* TCP_NODELAY set
* Connected to www.allremotedevjobs.com (3.104.92.194) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to www.allremotedevjobs.com:443 
* stopped the pause stream!
* Closing connection 0
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to www.allremotedevjobs.com:443

I have no idea what this error means and it seems quite hard to find anything meaningful from this but it seems obvious that it is linked to the ERR_CONNECTION_RESET in my browser.

I have a phoenix 1.5.1 app running on ECS/EC2 with docker, released with distillery. The way it works is pretty simple: my release is packaged in a docker container and pushed to AWS. My release docker-compose file is mapping the following ports:

ports:
   - '80:4000'
   - '443:5000'

To have the simplest possible example, I have decided to not use any environment variable so my http and https config are in prod.exs instead of releases.exs. Here is the part that I think is relevant to my problem:

  http: [
    port: 4000
  ],
  https: [
    port: 5000,
    keyfile: "/etc/letsencrypt/live/www.allremotedevjobs.com/privkey.pem",
    certfile: "/etc/letsencrypt/live/www.allremotedevjobs.com/cert.pem",
    cacertfile: "/etc/letsencrypt/live/www.allremotedevjobs.com/chain.pem"
  ]

In AWS, my security groups allows access to ports 80 and 443.

I have generated my certificate on my EC2 instance where the docker container runs using certbot, with certbot certonly --standalone. The certificates are successfully generated, and it seems that my phoenix app can read them as it doesn’t complain that the any of the files is not existing or cannot be read when deploying. I am using the full path as shown in the config to reference the cert.pem, privkey.pem and chain.pem files. I thought that I could just used fullchain.pem for the certfile and omit the cacertfile, but as all examples I have seen seem to use the 3 files, I’ve decided to do like everybody else. These 3 files are linked to the same path inside my container using volumes in docker-compose:

volumes:
  - /etc/letsencrypt/live/www.allremotedevjobs.com/privkey.pem:/etc/letsencrypt/live/www.allremotedevjobs.com/privkey.pem
  - /etc/letsencrypt/live/www.allremotedevjobs.com/cert.pem:/etc/letsencrypt/live/www.allremotedevjobs.com/cert.pem
  - /etc/letsencrypt/live/www.allremotedevjobs.com/chain.pem:/etc/letsencrypt/live/www.allremotedevjobs.com/chain.pem

I have tried a lot of different things in the config, tried, to regenerate my certificates a few times, and nothing seems to work. Worse, I have absolutely no idea what I could possibly do to make it work. I do not want to use nginx or apache as it is supposed to work without them.

If anyone has any idea on what I could try, that would be really amazing!

Thank you for your help.

cnck1387 · July 26, 2020, 1:14pm

Have you considered using ACM with a load balancer? If you’re using ECS, it sounds like you plan to run multiple web servers in a cluster and have them load balanced, and if you use amazon’s load balancer you can get SSL certificates managed for you for free with their ACM service.

I know it’s not directly answering your question, but maybe it’s a viable alternative. Especially since you’re generating certificates on your EC2 instances directly in an environment where you’re running a container orchestrator. What would happen if you run 2 replicas of your web server on the cluster? You would have to generate SSL certs on each server, and you may find yourself rate limited quite easily from let’s encrypt (you can only generate 5 certificates per week with the same domains).

lleger · July 26, 2020, 2:25pm

One thing that you can check is whether your ECS service is set up with an ELB/ALB already. If you used the AWS console’s wizard UI to create the service it’s possible you already have one without even realizing since I think it creates them automatically. If you have a load balancer this could be the issue since the load balancer terminates the SSL connection for you already, so it would never reach your Phoenix app as HTTPS.

You should also check the port mappings in your task definition. It could be that you didn’t map the ports correctly there. You’d need to map both 80 and 443 to your Docker host’s expected ports.

Another thing that could be helpful is to turn logging on to :info and then turn on Cloudwatch logging from ECS momentarily and see if you get any useful output. It sounds to me like the HTTPS connection never hits your Phoenix app at all, so that should help you narrow the debugging focus to infrastructure misconfiguration.

Also, really if you’re doing HTTPS on ECS, I highly recommend using an ALB with ACM since it does all of this hard work for you without having to manually manage any certs. If it’s helpful I gave a talk on this exact setup two years ago and there’s a full code repo with Terraform config scripts you can use to bootstrap an infrastructure like this: https://m.youtube.com/watch?v=JtUy68PeEWE&t=1020s

brundozer · July 26, 2020, 11:12pm

Thanks for taking the time to reply. I have actually used a Network Load Balancer with a ACM generated certificate on it before trying to get away from this solution. HTTPS worked fine except for HSTS, which is probably related to the x-forward-proto header that I need to manage.

The reason I wanted to avoid using a load balancer and is because I wanted to try to make it work with a simpler, minimal configuration and avoid getting a bit more locked-in with AWS. I wanted to understand better what I was doing and have a bit more control over it.

I understand the interest of using a load balancer and the limitations that come with Let’s Encrypt, but I think that I shouldn’t bother with what would happen when I would run more replicas yet as for the moment my only visitors are a couple of friends and maybe some readers of this post who tried to access the website out of curiosity.

So I believe you are right, it is a perfectly viable alternative, and would eventually switch back to it if I do not get able to make it work without it. But I see no reason why it shouldn’t work without a load balancer. It is precisely this gap in my understanding that I am trying to fill here.

Also, the only reason I am using ECS is convenience, not really scalability. You build your container (which I already had) and push it then it gets deployed on EC2.

brundozer · July 26, 2020, 11:30pm

Thank you for your reply. I am definitely going to watch your talk, that’s precisely the kind of topic I would like to understand better.

I haven’t got any load balancer at the moment. I had one before as described in my answer to @cnck1387 and try know to make it work without it. I wanted at first to use an Application Load Balancer, but as my registrar does not provide ALIAS records and that it is not possible to use an Elastic IP to point to an ALB, I had to use a Network Load Balancer instead. From what I read on the AWS blog, Network Load Balancers offer TLS termination since January 2019. I am a bit unfamiliar with all these concepts, does this mean that everything behind this load balancer doesn’t need to run on HTTPS? This is contradicting some other posts I have read saying that everything behind a Network Load Balancer should be on HTTPS, but those posts might have been written before TLS termination was possible with this type of load balancers.

Regarding the port mapping, I see in my task definition that my port 80 is mapped to the port 4000, and the port 443 is mapped to port 5000, which is what I expect it to be. The volumes mapped are correct too. Everything seems good there. These values in the task are created automatically from my docker-compose file with ecs-cli compose in my deployment script, so I do not expect big surprises in there.

I will try to turn the logging on to see if I can get more info. Thanks for the suggestion.

For the load balancer, I really want to try to make it work without it first. Or at least understand why it is impossibe if this is the case.

lleger · July 26, 2020, 11:57pm

SSL termination basically means the SSL connection is complete. So if the NLB handles SSL termination, it’s what does the encryption work, then hands off a non-SSL connection to the server to do the rest of the work.

Whether you can terminate at the load balancer-level really depends on your needs. In certain workloads (e.g. HIPAA) you may need to encrypt all the way through the stack, so what you’d do is terminate the public-SSL (i.e., the SSL for the TLD) and then use a different SSL for the traffic within your system (i.e., in between the NLB and EC2/ECS). However, for most contexts, securing your infrastructure internally and using SSL would be “secure enough” and the headache (administrative overhead, system resource allocation, etc) is just not worth it.

I totally understand you wanting to make it work without using an LB. My best guess is a configuration issue. Phoenix should be handling the termination for you and it failing seems to indicate configuration.

One possible thing to check is that your Dockerfile has the OpenSSL bindings. If you’re using an optimized Docker image, maybe that was left out?

Here’s an easy way to check:

The Erlang/OTP runtime, with OpenSSL bindings; run :crypto.info_lib() in an IEx session to verify

From here, which is probably a good resource for debugging since internally Phoenix is just handing this work off to Plug.SSL.

That documentation also seems to imply the cipher_suite option is required, so maybe try setting that? There could be a silent failure there.

But of course more logging should help as well Happy bug hunting

brundozer · July 30, 2020, 9:33am

I’ve managed to solve the issue. It was simply a permission issue on the files generated by Let’s Encrypt. The only error I got was the one that I described in my first post. It is quite frustrating that there is no clear error message saying that the certificates could not be read.

I have found this post describing a similar issue, but as the author solved the problem by using nginx, which runs as root and is therefore allowed to read the certificates generated by Let’s Encrypt, I have missed the second reply which would have fixed my issue:

Anyway, thanks for the help!

ntilwalli · September 28, 2020, 12:11am

It was simply a permission issue on the files generated by Let’s Encrypt.

Seeing this line was critical for solving my issue. The /etc/letsencrypt/live/somedomain directly has soft links to the ../../archive/somedomain/ directory. The file permissions in the archive directory are what matter. I initially didn’t investigate past the perms of the soft links in the live directory. The privkey1.pem in the archive directory is initially created (by certbot) as only readable by the owner and needs to be be chmod 644’d so it’s readable by the Elixir server process which does not run as root. Def frustrating that there is no clear error message.

Anyway, thanks!

ragamuf · June 7, 2021, 7:03pm

Wow!! Thank you sir. This problem had me stomped and the error was so uninformative.