DBConnection.ConnectionError PID(XXX) exited for some sql queries

Ayliane · December 10, 2019, 8:57am

Hi there,

I’m currently struggling with this bug, because i don’t have much logs and infos provided by my server. I have an endpoint called /stop (code sample provided below) in which i execute a lot of code before returning :ok . All this code is wrap in multiple functions and through multiple files and services.

Not far from the end of those functions, i’m writing to the database. I have 3 different inserts. But since a few days, i have this error : Postgrex.Protocol (#PID<0.375.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.543.0> exited showing up randomly in one of those 3 inserts. (It could be the first one, it could be in the middle of the second one, or the last one. And sometimes (rarely) it doesn’t even fail). The only thing i’ve been able to isolate as a “trigger” is when the inserted data is a little bigger. It has always worked before, and it also worked nicely when i’m trying to insert those same datas directly via iex.

On the Postgres side, i only have this log : LOG: unexpected EOF on client connection with an open transaction .

The controller code looks like this :

  def stop(conn, params) do
        case @influx_caller.end_connexion_performable?(params) do
          true ->
            handle_end_of_connexion(params)
            conn |> send_resp(:ok, [])

          false ->
            if @influx_caller.connexion_data_from(params["connexion_key"]) == %{} do
              update_connexion_with(:influx_empty, params, true)
            end

            conn |> send_resp(:forbidden, [])
        end
      end
    end
  end

The failure happens in the happy path (the true option of the case) in one of the files called by the handle_end_of_connexion function. For example in this query :

Ecto.Changeset.change(connexion, %{
        closed_at: Calculation.get_close_time(events),
      })
      |> Ecto.Changeset.put_assoc(:infos, MyApp.InfosGenerator.perform(obj, obj_type, twi, twl, events))
      |> MyApp.Repo.update

That exact same file with that exact same data can be called from another controller in my app (sort of admin panel), and it works just fine. So i think my problem is coming from the /stop endpoint.

Things i’ve tested so far :

Checking my OTP version. It was 21.3 and i found a lot of reports talking about similar errors coming from it, so i updated it to 22.1.7
Activating the show_sensitive_data_on_connection_error hoping it would add some logs to investigate, but no luck.
Updating phoenix, plug_cowboy, postgrex, and ecto to the latest version.
Adding http: [protocol_options: [idle_timeout: :infinity]], in my Mix.Config, since i’ve also read about a timeout from cowboy since it’s 2.0 version.

None of this works. (Of course, i also tried to log things everywhere, but everything is working just fine in local mode)

Any help or at least lead would be much appreciated. Thanks in advance !

[EDIT] This bug only happens in production and staging environment, but not locally. We figured that if we gave more memory to our containers, the error still show up, but less. We’re suspecting that the process gets killed by phoenix, but we can’t figure out why and when

[EDIT 2] Thanks to observer, i managed to collect those additional information :

When this errors happens : Postgrex.Protocol (#PID<0.422.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.31212.0> exited, i have those lines popping in observer =>

11:06:52:783159 (<0.31212.0>) out {lists,reverse,1}
11:06:52:783212 (<0.31212.0>) in {lists,reverse,1}
11:06:52:783218 (<0.31212.0>) << {inet_reply,#Port<0.71>,ok}
11:06:52:783234 (<0.31212.0>) out {prim_inet,send_recv_reply,2}
11:06:52:783243 (<0.31212.0>) in {prim_inet,send_recv_reply,2}
11:06:52:783247 (<0.31212.0>) out {prim_inet,recv0,3}
11:06:52:797698 (<0.31209.0>) in {cowboy_http,loop,1}
11:06:52:797725 (<0.31209.0>) << {tcp_closed,#Port<0.140>}
11:06:52:797750 (<0.31209.0>) getting_unlinked #Port<0.140>
11:06:52:797786 (<0.31209.0>) out {cowboy_children,terminate_loop,2}
11:06:52:797794 (<0.31212.0>) in {prim_inet,recv0,3}
11:06:52:797800 (<0.31212.0>) exit shutdown
11:06:52:797820 (<0.31212.0>) out_exited 0
11:06:52:797905 (<0.31209.0>) in {cowboy_children,terminate_loop,2}
11:06:52:797911 (<0.31209.0>) getting_unlinked <0.31212.0>
11:06:52:797915 (<0.31209.0>) << {'EXIT',<0.31212.0>,shutdown}
11:06:52:797943 (<0.31209.0>) exit {shutdown,{socket_error,closed,'The socket has been closed.'}}
11:06:52:797951 (<0.31209.0>) out_exited 0

The process was spawn here :

11:06:52:304313 (<0.31209.0>) spawn <0.31212.0> as proc_lib:init_p(<0.31209.0>,[<0.447.0>,<0.446.0>,'Elixir.Tracking.Endpoint',<0.414.0>,<0.413.0>],cowboy_stream_h,request_process,[#{body_length => 136,cert => undefined,has_body => true, [etc...]

I really don’t understand what and why this process is closed and the link is lost

kokolegorille · December 10, 2019, 9:19am

Hello et bienvenue,

Did You try to increase the timeout of the Repo?

put_assoc will do everything in a transaction, if it takes too much time, it will fail.

You can pass a timeout…

MyApp.Repo.update(timeout: 60_000)

Ayliane · December 10, 2019, 9:48am

Thanks for your answer

I did try that (and tried it again just now just to be sure) and it doesn’t work. The transaction is not that long Actually everything is running pretty fast until it blows up with this exit.

I’m trying to follow the incriminated processes with :observer but it gets killed before i can see any stacktrace.

jola · December 10, 2019, 2:12pm

Because this error comes up now and then, and it’s a confusing one, I’ll just clarify what it’s saying. Obviously the error is related to the database query, but it’s not caused by the database query. DBConnection is reporting that the client exited. The client, in this scenario, is the Phoenix request process that initiated the request.

So the Phoenix request process exited and therefore DBConnection errored. A common cause for the Phoenix request to exit is a timeout. Timeouts are not handled by Phoenix at all, rather by Cowboy, which is why Phoenix didn’t log an error. Cowboy will time out a request by default in 60s of inactivity, but there are other options at play as well.

You can find the options here Nine Nines: cowboy_http(3)

You did add http: [protocol_options: [idle_timeout: :infinity]], which was the most likely fix, are you sure it’s in the right config file?

I’m also curious what you see on the browser side (I’m assuming you’re initiating this from a browser). Are you getting any response at all?

Ayliane · December 10, 2019, 2:23pm

Hi Jola, and thanks for your reply

Yeah, i’m also suspecting Cowboy and I’m rather sure I added the idle_timeout in the right config file (config.exs, and just to be sure, also in my env specific files).

I do instantiate the request from a browser but it launches when the page is closed, so i’m not waiting for a response.

I have found a temporary fix : I’ve moved the handle_end_of_connexion(params) function in a worker, and everything works fine again but i’m really not satisfied with this workaround since I’m under the impression that i’m really missing an important point here ><

The point is that all the work done in this handle_end_of_connexion, including the database inserts are NOT taking 60secs. Everything is done in less than 30secs so my feeling is that it’s not the cowboy timeout.

joaoevangelista · December 10, 2019, 3:25pm

Hi

Checking out your code seems that everything is sync with the connection, since the page is closed, the connection waiting for the response, is then closed, cowboy receives this and kills all children of the request process, including your executing connection. Maybe you were wanting is to execute in the background in the first place, and just got things a little confused, but your “workaround” is the best to execute on background and return immediately. Remember that for all requests should be a response.

I might be wrong, on the cowboy killing the connection because it was closed. And everything else

acrolink · December 10, 2019, 3:32pm

Not sure if it is related by try:

Check and increae open file limit.
Increase SWAP memory size.

Ayliane · December 10, 2019, 3:59pm

Hey thanks, but I don’t think it’s that since we have this behavior on production since 2 years now and it just started to react this way But it could explain why the “workaround” works. I’m just lost with this ><

jola · December 10, 2019, 4:41pm

+1 on @joaoevangelista’s explanation, it seems like a race condition where sometimes the query executes before the request is killed. But if you close a page while it’s making a request, the request will get canceled (and all the Phoenix request process and all linked processes will get killed). The reason you started seeing the problem could be anything, maybe the browser cancels the request faster, or the database grew and the query slowed down, etc.

And like João mentioned, if you want something to keep running after the request process exits you need to run it in a different process.

Ayliane · December 16, 2019, 3:30pm

Thanks a lot for all the explanations, it was very usefull ! Did i do well using a worker (We’re currently using Toniq / redis to handle stuff like that) and coming from Rails it didn’t seem “appropriate” to me. Do i have another option to start another process inside a controller ?

dimitarvp · December 16, 2019, 4:13pm

You can do it yourself very easily but what is it that you are trying to do? Start a background job, or something else?

You either have to increase timeouts or use something like Oban (it has a thread here in this forum).

joaoevangelista · December 17, 2019, 1:22am

If you want your task to be durable, for instance your database is down and you need to persist this data, then your approach is good, because AFAIK the failed job will return to the redis queue. If you don’t need the guarantee that it will persist then you can use a Task.Supervisor with the function async_nolink/3.

tanish · September 8, 2023, 9:47am

setup do
# Explicitly get a connection before each test
:ok = Ecto.Adapters.SQL.Sandbox.checkout(DocumentFetcher.Repo)
end

just add this it should workd