Elixir rest api returning 504 Gateway Time-out

otp
phoenix
rest_api

#1

during high load I see the rest api or even the welcome page localhost:4000 returns 504 Gateway Time-out error, I am usng OTP 20. Any clues how to improve performance

Elixir 1.7.3 (compiled with Erlang/OTP 20)


#2

Would need more information. I’ve seiged my server with over a 30k requests per second on 2-cores without an incorrect result (and after that it just slows down a lot more but still no incorrect, dropped, or invalid responses).

If it’s a 504 Gateway Time-out that sounds like there is a front-loader/gateway in front of Elixir and it is what’s getting overwhelmed (perhaps it has only a single connection open to the backend server?).

What is the gateway program, what defines ‘high load’, what is the gateway’s configuration for the reverse proxy, what are the hardware specs, etc… etc… :slight_smile:


#3

All requests go through cloudflare,if I don’t go through cloudflare and hit the server with ip directly even the welcome page shows 504 intermitently. This is on c4xlarge instance on ec2. It has 16 vcpus. How do I find the number of connections defined for backend (do you mean database) ?

In the config I see that the adapter is Ecto.Adapters.Postgres with pool size of 50. The maximum requests per second requests might be 3k requests updating the db and max 500 requests per second reading the info from rest api

I am maintaining the code some else has written, I can read the code and understand what process is doing. Let me know if you need additional information.


#4

When you hit phoenix directly, it should never ever give you a 504, as it not acts as a gateway…


#5

I just verified the config, the requests are going through amazon loadbalancer with target instance defined on port 4000. Requests hit cloudflare which has mapping to amazon loadbalancer which then forwards the request to the target instance


#6

Do you have any issues, when you hit your application without any MITMs?

Do you have any issues when you hit your NGINX without any MITMs?

Do you have any issues when you use some MITM CDN?

Do you use free or paid plans?

My current experience with CDN providers like CloudFlare is to simply avoid them. They add hard to debug overhead without much benefit. Much more important is to have the servers “near” to the expected clients.


#7

the clients are mobile apps updating data and also mobile and web clients accessing data from phoenix. We have paid plans


#8

It’s still good to test each layer as that will tell you where the failures are happening. It’s extremely unlikely to be a fault of Elixir as this is the kind of work that the BEAM is designed for, it doesn’t fault in these ways, and it doesn’t return 504’s unless the program itself explicitly does so, so it is very likely to be one of those front gateways, so need to test each in turn. :slight_smile:


#9

I am currently checking with cloudflare support to check at their end. when I check top on the vm, I see beam is consuming 590.7 %cpu, is this normal on 16vcpu instance ?


#10

Using 6vcpu’s worth on a 16vcpu is definitely not saturated so the BEAM would be responding just fine then.

The overall usage is dictated primarily by the application code though, so that could be normal or high all depending on the code. But considering it’s not hitting all 16 vCPU’s then it’s definitely not hitting it’s limits yet.

However, for a usual webapp that’s pretty high CPU usage so either it’s doing a lot internally or it’s really getting hit by potentially hundreds of thousands of connection.

For getting usage ‘inside’ the the built-in :observer module via :observer.start() is great for that if you can X over a GUI connection or so (or connect to it remotely if it’s exposed or can be exposed over, say, ssh or so), or if in a pure text shell you can get a reduced set of information by just opening a shell into the VM and running :etop.start() to show the top actors/processes.

However again, getting a 504 is entirely not normal from a standard Phoenix-built app, that still sounds like one of the gateways/load-balancers is failing and it would be good to bypass them at least for a time as a test.


#11

I enterd this after starting the shell, but do not see any info. The clients post their locations every 5 seconds and there are other clients accessing this location information. the post api saves the lat,lon in the postgres db and the other mobile and web clients request lat, lon info using get apis


#12

Hmm, you should immediately see a lot of info, like here is what I get:

iex(devserver2@127.0.0.1)2> :etop.start()

========================================================================================
 'devserver2@127.0.0.1'                                                    21:11:05
 Load:  cpu         0               Memory:  total       62468    binary       1427
        procs     384                        processes   24416    code        22299
        runq        0                        atom         1017    ets          2741

Pid            Name or Initial Func    Time    Reds  Memory    MsgQ Current Function
----------------------------------------------------------------------------------------
<0.1027.0>     'Elixir.FileSystem.B     '-'43147764  689568       0 gen_server:loop/7
<0.286.0>      'Elixir.Phoenix.Code     '-'30204171 5692948       0 gen_server:loop/7
<0.9.0>        erl_prim_loader          '-'18242316  689524       0 erl_prim_loader:loop
<0.1040.0>     'Elixir.DBConnection     '-'17875451   25164       0 gen_server:loop/7
<0.1026.0>     phoenix_live_reload_     '-'11739249   42436       0 gen_server:loop/7
<0.277.0>      cowboy_clock             '-' 9445910    8932       0 gen_server:loop/7
<0.1.0>        erts_code_purger         '-' 7799639   44776       0 erts_code_purger:wai
<0.2.0>        erts_literal_area_co     '-' 6650619    2688       0 erts_literal_area_co
<0.1048.0>     'Elixir.DBConnection     '-' 4554875    4176       0 erlang:hibernate/3
<0.1044.0>     'Elixir.DBConnection     '-' 4494410    4176       0 erlang:hibernate/3
========================================================================================

Which keeps updating every few seconds (my dev server here is not under heavy use at all, or really any use). You have to kill your shell to stop it though so be careful not to kill your server instead, hence why remote shells are best. :slight_smile:

Just as an efficiency check, they are keeping an active TCP connection to the server to do that yes? Setting up and tearing down tons of HTTPS connections on Gateways does have a tendency to kill them pretty badly even though elixir handles it well.


#13

I type erl and I get the below shell

Eshell V9.0 (abort with ^G)
1>
here I type :etop.start()

not sure if I am in the correct shell


#14

Oh that’s erl! You’ll want to run etop:start() instead there, that’s the old Erlang shell. :slight_smile:

Use iex for the elixir shell in comparison.

And don’t forget to connect a remote shell to the other node unless you are spooling up the full server.

If you are using distillery it has a built-in easy command for it though, I highly recommend it, something like `my_server rpc ‘:etop.start()’. :slight_smile:


#15

I will try it out, I use the below command to build and execute

mix deps.get
MIX_ENV=final mix compile
MIX_ENV=final mix release


#16

I will have to investigate this further,I tried it on dev and if use iex and then try to kill the shell its shutting down the who beam process :slight_smile:


#17

Correct, it takes over the shell that it is run on so you want to use a remote shell or RPC or so. :slight_smile:


#18

Have you check if the backend database is getting overwhelmed?


#19

Even if it were then it wouldn’t be giving back a 504, that’s a response when there isn’t an accessible server at all on the reverse proxy. Although if the reverse proxy had the timeout set too short and the database was overloaded then that could cause it, hmm, worth checking!


#20

Yeah that was my thinking exactly.