during high load I see the rest api or even the welcome page localhost:4000 returns 504 Gateway Time-out error, I am usng OTP 20. Any clues how to improve performance
Elixir 1.7.3 (compiled with Erlang/OTP 20)
during high load I see the rest api or even the welcome page localhost:4000 returns 504 Gateway Time-out error, I am usng OTP 20. Any clues how to improve performance
Elixir 1.7.3 (compiled with Erlang/OTP 20)
Would need more information. Iâve seiged my server with over a 30k requests per second on 2-cores without an incorrect result (and after that it just slows down a lot more but still no incorrect, dropped, or invalid responses).
If itâs a 504 Gateway Time-out that sounds like there is a front-loader/gateway in front of Elixir and it is whatâs getting overwhelmed (perhaps it has only a single connection open to the backend server?).
What is the gateway program, what defines âhigh loadâ, what is the gatewayâs configuration for the reverse proxy, what are the hardware specs, etc⊠etc⊠
All requests go through cloudflare,if I donât go through cloudflare and hit the server with ip directly even the welcome page shows 504 intermitently. This is on c4xlarge instance on ec2. It has 16 vcpus. How do I find the number of connections defined for backend (do you mean database) ?
In the config I see that the adapter is Ecto.Adapters.Postgres with pool size of 50. The maximum requests per second requests might be 3k requests updating the db and max 500 requests per second reading the info from rest api
I am maintaining the code some else has written, I can read the code and understand what process is doing. Let me know if you need additional information.
When you hit phoenix directly, it should never ever give you a 504, as it not acts as a gatewayâŠ
I just verified the config, the requests are going through amazon loadbalancer with target instance defined on port 4000. Requests hit cloudflare which has mapping to amazon loadbalancer which then forwards the request to the target instance
Do you have any issues, when you hit your application without any MITMs?
Do you have any issues when you hit your NGINX without any MITMs?
Do you have any issues when you use some MITM CDN?
Do you use free or paid plans?
My current experience with CDN providers like CloudFlare is to simply avoid them. They add hard to debug overhead without much benefit. Much more important is to have the servers ânearâ to the expected clients.
the clients are mobile apps updating data and also mobile and web clients accessing data from phoenix. We have paid plans
Itâs still good to test each layer as that will tell you where the failures are happening. Itâs extremely unlikely to be a fault of Elixir as this is the kind of work that the BEAM is designed for, it doesnât fault in these ways, and it doesnât return 504âs unless the program itself explicitly does so, so it is very likely to be one of those front gateways, so need to test each in turn. 
I am currently checking with cloudflare support to check at their end. when I check top on the vm, I see beam is consuming 590.7 %cpu, is this normal on 16vcpu instance ?
Using 6vcpuâs worth on a 16vcpu is definitely not saturated so the BEAM would be responding just fine then.
The overall usage is dictated primarily by the application code though, so that could be normal or high all depending on the code. But considering itâs not hitting all 16 vCPUâs then itâs definitely not hitting itâs limits yet.
However, for a usual webapp thatâs pretty high CPU usage so either itâs doing a lot internally or itâs really getting hit by potentially hundreds of thousands of connection.
For getting usage âinsideâ the the built-in :observer module via :observer.start() is great for that if you can X over a GUI connection or so (or connect to it remotely if itâs exposed or can be exposed over, say, ssh or so), or if in a pure text shell you can get a reduced set of information by just opening a shell into the VM and running :etop.start() to show the top actors/processes.
However again, getting a 504 is entirely not normal from a standard Phoenix-built app, that still sounds like one of the gateways/load-balancers is failing and it would be good to bypass them at least for a time as a test.
I enterd this after starting the shell, but do not see any info. The clients post their locations every 5 seconds and there are other clients accessing this location information. the post api saves the lat,lon in the postgres db and the other mobile and web clients request lat, lon info using get apis
Hmm, you should immediately see a lot of info, like here is what I get:
iex(devserver2@127.0.0.1)2> :etop.start()
========================================================================================
'devserver2@127.0.0.1' 21:11:05
Load: cpu 0 Memory: total 62468 binary 1427
procs 384 processes 24416 code 22299
runq 0 atom 1017 ets 2741
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
----------------------------------------------------------------------------------------
<0.1027.0> 'Elixir.FileSystem.B '-'43147764 689568 0 gen_server:loop/7
<0.286.0> 'Elixir.Phoenix.Code '-'30204171 5692948 0 gen_server:loop/7
<0.9.0> erl_prim_loader '-'18242316 689524 0 erl_prim_loader:loop
<0.1040.0> 'Elixir.DBConnection '-'17875451 25164 0 gen_server:loop/7
<0.1026.0> phoenix_live_reload_ '-'11739249 42436 0 gen_server:loop/7
<0.277.0> cowboy_clock '-' 9445910 8932 0 gen_server:loop/7
<0.1.0> erts_code_purger '-' 7799639 44776 0 erts_code_purger:wai
<0.2.0> erts_literal_area_co '-' 6650619 2688 0 erts_literal_area_co
<0.1048.0> 'Elixir.DBConnection '-' 4554875 4176 0 erlang:hibernate/3
<0.1044.0> 'Elixir.DBConnection '-' 4494410 4176 0 erlang:hibernate/3
========================================================================================
Which keeps updating every few seconds (my dev server here is not under heavy use at all, or really any use). You have to kill your shell to stop it though so be careful not to kill your server instead, hence why remote shells are best. ![]()
Just as an efficiency check, they are keeping an active TCP connection to the server to do that yes? Setting up and tearing down tons of HTTPS connections on Gateways does have a tendency to kill them pretty badly even though elixir handles it well.
I type erl and I get the below shell
Eshell V9.0 (abort with ^G)
1>
here I type :etop.start()
not sure if I am in the correct shell
Oh thatâs erl! Youâll want to run etop:start() instead there, thatâs the old Erlang shell. 
Use iex for the elixir shell in comparison.
And donât forget to connect a remote shell to the other node unless you are spooling up the full server.
If you are using distillery it has a built-in easy command for it though, I highly recommend it, something like `my_server rpc â:etop.start()â. 
I will try it out, I use the below command to build and execute
mix deps.get
MIX_ENV=final mix compile
MIX_ENV=final mix release
I will have to investigate this further,I tried it on dev and if use iex and then try to kill the shell its shutting down the who beam process ![]()
Correct, it takes over the shell that it is run on so you want to use a remote shell or RPC or so. ![]()
Have you check if the backend database is getting overwhelmed?
Even if it were then it wouldnât be giving back a 504, thatâs a response when there isnât an accessible server at all on the reverse proxy. Although if the reverse proxy had the timeout set too short and the database was overloaded then that could cause it, hmm, worth checking!
Yeah that was my thinking exactly.