Performance issues (slowness) after elixir upgrade

agacode · November 13, 2019, 4:00pm

Hi Guys,

I’m have a performance issue after an elixir upgrade:

before:
elixir 1.6.4
erlang 20.3
phoenix 1.3.2

now:
elixir 1.9.0
erlang 22.0.4
phoenix 1.4.10

2000 requests/sec, 100 DB Connections (Postgres)

Before the 2000 requests executed in about 4 - 5 min, but the average time for each request was like 6s (I know this number is high and I need to cache some results to lower the request execution time, but that is not the problem I’m facing)

After the upgrade the total time is less about 2 - 3 min, but the average time for each request is 40s. It was a surprise to me that the new Elixir with the same settings as before was performing worse, even more, I’m losing some requests the HAProxy (working as reverse proxy) shows 2000 and my elixir backend app shows 1600. I even had to tweak some configuration values (to be able to finish most of the request) that I didn’t have to tweak before.

When I look at the logs locally, I see that the DBConnection Pool is switching too much between request and it is trying to execute close to 1 query on each request. So, it means that I have close to 2000 request executing little by little (I think because of this new CoDel alg). I would prefer to use a setting (If there was one) to assign a DBConnection to a request till the request finishes executing. I was looking also into DBConnection.Ownership to play with the :ownership_timeout but I also read that DBConnection.Ownership is more for tests (might have misinterpreted this)

My current settings:

	config :myapp, BeaconWeb.Endpoint,
	  http: [
		...
		protocol_options: [
		  idle_timeout: 1_200_000,
		  inactivity_timeout: 1_200_000
		]
	  ]

	config :myapp, Beacon.Repo,
      ...
	  pool_size: 100,
	  timeout: 100_000_000,
	  queue_target: 100_000_000

As additional information I’m getting this message frequently on localhost:
With DBConnection.ConnectionPool:

[info] [] Postgrex.Protocol (#PID<0.2384.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.4135.0> exited

With DBConnection.Ownership:

[error] [] Postgrex.Protocol (#PID<0.2384.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.4135.0> exited

Documentation I have checked:
https://hexdocs.pm/db_connection/DBConnection.html
https://hexdocs.pm/db_connection/DBConnection.Ownership.html
https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html
https://ninenines.eu/docs/en/cowboy/2.5/manual/cowboy_http/
https://ninenines.eu/docs/en/ranch/1.6/manual/ranch/

Any help will be appreciated. Thanks in advance.

benwilson512 · November 13, 2019, 4:03pm

You can do so explicitly via:

MyApp.Repo.checkout(fn ->
  # all calls made inside here will use the same connection.
end)

agacode · November 13, 2019, 4:10pm

@benwilson512 Thanks for your rapid reply, I would have to do that for the whole application, the Repo is used in multiple modules. I was looking more for a global setting or so. I just upgraded the application and facing this problem, I was expecting to perform better out of the box. But I will keep that suggestion to use it in the case I cannot find anything at a global level. Thank you

benwilson512 · November 13, 2019, 4:27pm

@agacode Usually if you wanted to do that “app wide” you’d basically just do it at the controller or plug call function or whatever. Doing it app wide is a tricky concept with all the different processes running around. Most processes never even touch the database, so there has to be an explicit point at which you say “This process is checking out a database connection”.

agacode · November 13, 2019, 4:32pm

@benwilson512 I’ll try that and see how it goes. What pool? regular ConnectionPool or Ownership or any other?

benwilson512 · November 13, 2019, 4:34pm

Regular. If your responses are taking 40 seconds though I’d seriously audit the database requests being made. Could you also supply the actual mechanism you’re using to benchmark?

agacode · November 13, 2019, 4:50pm

If your responses are taking 40 seconds though I’d seriously audit the database requests being made

There is almost no time for refactors due to a constant business need for new features and other constraints that don’t allow the developers directly access/audit the DB.
But my comparison/baseline is the following:
old version 6s vs new version 40s (Same old code, project generated from scratch and adjusted)

Could you also supply the actual mechanism you’re using to benchmark?

We had benchee before for the old version although we haven’t run it in a very long time.

benwilson512 · November 13, 2019, 4:52pm

By benchmark I mean what are you using to generate the 2000 req / sec?

In rereading your original question, I’m curious how you’re measuring this data. If after the upgrade everything completed nearly twice as fast it seems impossible that the average request time increased by nearly a factor of 7.

agacode · November 13, 2019, 5:08pm

Oh sorry I misunderstood your question, we use JMeter to do the load tests, we also have production API clients that send a big batch of requests and provide us later with the amount of requests they sent, also HAProxy is registering all the connections/requests.

Each request individually takes longer (I assume because of the amount of queries we execute within the request and it is multiplexing too much), the whole batch takes a bit less than before. Maybe there is no logical explanation for that, but those are the facts.

I will tests with Repo.checkout in a request level place and see how it goes. Thank you so much!

benwilson512 · November 13, 2019, 5:13pm

I would consider renaming the question to be more about the general performance issue you’re experiencing after upgrading by the way. Doing an explicit database checkout may help, it might also not. The issue you’re experiencing is the slowdown, there may be other avenues to approach in so far as improving that goes.

agacode · November 13, 2019, 5:20pm

Done …

OvermindDL1 · November 13, 2019, 5:24pm

I’d be curious to see the debug log sql processing times to see which is actually the slow one first of all?

l00ker · November 13, 2019, 5:49pm

This suggestion might be going out on a limb… but have a look at this recent post:

Disable prepared statements for an individual query

I didn’t bother to research when prepared statements where introduced in ecto etc., but since this was such a large upgrade jump, maybe you’re being affected in a similar way and it’s adding up fast due to the shear volume of queries.

It’s a long shot, but the read and eventual outcome might prove educational either way

agacode · November 14, 2019, 12:15am

I put Repo.checkout in the controller action in my localhost, the pool size is 10
It is processing 10 request only at the same time, but it is getting stuck/hung.
I have run some commands that I’ve found on a closed issue for db_connection and the output is the following:

iex(7)> pid = spawn(fn -> Beacon.Repo.query("SELECT 1") end)
#PID<0.2153.0>
iex(8)> Process.info(pid)
[
  current_function: {DBConnection.Holder, :checkout_call, 5},
  initial_call: {:erlang, :apply, 2},
  status: :waiting,
  message_queue_len: 0,
  links: [],
  dictionary: [],
  trap_exit: false,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.65.0>,
  total_heap_size: 233,
  heap_size: 233,
  stack_size: 38,
  reductions: 188,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 0
  ],
  suspending: []
]
iex(9)> Process.info(pid, :current_stacktrace)
{:current_stacktrace,
 [
   {DBConnection.Holder, :checkout_call, 5,
    [file: 'lib/db_connection/holder.ex', line: 251]},
   {DBConnection.Holder, :checkout, 2,
    [file: 'lib/db_connection/holder.ex', line: 60]},
   {DBConnection, :checkout, 3, [file: 'lib/db_connection.ex', line: 1030]},
   {DBConnection, :run, 6, [file: 'lib/db_connection.ex', line: 1340]},
   {DBConnection, :parsed_prepare_execute, 5,
    [file: 'lib/db_connection.ex', line: 540]},
   {DBConnection, :prepare_execute, 4,
    [file: 'lib/db_connection.ex', line: 533]},
   {Postgrex, :query_prepare_execute, 4, [file: 'lib/postgrex.ex', line: 219]}
 ]}

Any idea why might be getting hung? Thank you

benwilson512 · November 14, 2019, 1:27am

What is getting hung, the 10 requests or the iex command? Are you running that spawn command while the 10 requests are happening? If you have a pool size of 10 and there are 10 requests happening each with a connection, your pid in console MUST hang until a connection becomes available.

agacode · November 14, 2019, 2:19pm

The 10 requests are getting hung. Actually they all executed till the end I had a 200 (OK) response from each of them in JMeter. But it won’t execute any other request after that, no more logs in the console, no activity, nothing.

I ran those commands because I saw them in this post from @josevalim

I tried then replacing Repo.checkout by Repo.transaction (The queries can execute in a transaction) because I saw it in this article from @josevalim, but that did not solve the problem either, not getting hung with Repo.transaction, but the same behavior as before very slow.

I also tried the master branch for db_connection but I got the same behavior.

Thanks again for looking into this.

dimitarvp · December 1, 2019, 4:17pm

@agacode This. Give us some more details, there are practically zero at this point.