Cowboy handlers proably exit, not being logged or reported anywhere

hubertlepicki · April 19, 2021, 11:10am

So I kind of knew of this behavior and I stumbled upon it before but actually I am not sure what is the best way to solve / debug the issue.

So on production I am observing on occasion requests that are being sent to our system, and I can see the initial Logger line for the request as being handled by the app:

12:59:19.614 request_id=Fnc9RXabbAmFCZ0AAAdG [info] GET /

but for these requests I am missing the log entry that a response was sent, i.e. no line like:

12:59:19.624 request_id=Fnc9RXabbAmFCZ0AAAdG  Sent 200 in 10ms

As far as I understand, this means the Cowboy handler process exitted before it’s registered “before_send” callback is executed, which is being installed here by Plug.Logger:

github.com

elixir-plug/plug/blob/v1.11.1/lib/plug/logger.ex#L35


end

@impl true
def call(conn, level) do
  Logger.log(level, fn ->
    [conn.method, ?\s, conn.request_path]
  end)

  start = System.monotonic_time()

  Conn.register_before_send(conn, fn conn ->
    Logger.log(level, fn ->
      stop = System.monotonic_time()
      diff = System.convert_time_unit(stop - start, :native, :microsecond)
      status = Integer.to_string(conn.status)

      [connection_type(conn), ?\s, status, " in ", formatted_diff(diff)]
    end)

    conn
  end)

So, I have stumbled upon the situation that my requests fail, process exits, and I have no entry in logs, nor in AppSignal (we use Appsignal.Plug) as it’s just handling throws and exceptions, but not exits.

Since Cowboy is built not on OTP primitives, we don’t see the usual crash reports for these processes either. They seem to silently fail.

Now, I need to debug and fix the issue but also monitor for this not happening in the future. So I have several questions:

Am I missing something that on Cowboy / Plug / Phoenix level would detect, log and report these mysterious exits? I think the answer is “no” and that situation is simply not handled by the above.
The implementation should be as per this blog post (of mine) from 2 years ago, i.e. start a monitoring process for each request handler in a plug and log / report exits or is there some ready to use piece of infrastructure / library I should be using instead? When web requests fail in Elixir and Phoenix | AmberBit Sp. z o. o.

al2o3cr · April 19, 2021, 2:20pm

What’s the log level set to in production? IIRC the “unexpected exit” reports are written at info.

axelson · April 19, 2021, 6:12pm

Hmm, seems like a potentially similar issue to this recent (unresolved) post: Phoenix didn't handle some http requests

josevalim · April 19, 2021, 7:23pm

Most likely the request process crashed due to a link. Cowboy should report those, unless the exit reason is shutdown, which doesn’t log anything throughout OTP.

Given you know the request path, see if that path is starting or communicating with any process that might exit.

hubertlepicki · April 20, 2021, 7:45am

So what is happening is when a linked process crashes (as in throws an exception) I am getting crash reports properly:

  def index(conn, _params) do                                                                                                                                            
    Agent.start_link fn ->                                                                                                                                               
      raise "wat"                                                                                                                                                        
    end                                                                                                                                                                  
                                                                                                                                                                         
    render(conn, "index.html")                                                                                                                                           
  end

09:31:10.288 [error] CRASH REPORT Process <0.1088.0> with 1 neighbours crashed with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"wat">>} in 'Elixir.UI.DashboardController':'-index/2-fun-0-'/0 line 8
09:31:10.289 [error] Cowboy stream 8 with ranch listener 'Elixir.UI.Endpoint.HTTP' and connection process <0.995.0> had its request process exit with reason: #{'__exception__' => true,'__struct__' => 'Elixir.RuntimeError',message => <<"wat">>} in 'Elixir.UI.DashboardController':'-index/2-fun-0-'/0 line 8

When a linked process exits, however, like this:

  def index(conn, _params) do                                                                                                                                            
    Agent.start_link fn ->                                                                                                                                               
      Process.exit(self(), :kill)                                                                                                                                        
    end                                                                                                                                                                  
                                                                                                                                                                         
    render(conn, "index.html")                                                                                                                                           
  end

I only get the initla Plug.Logger line and then nothing, and the browser keeps spinning and loads this error:

Note: the exit reason can be anything, not just :normal, and Cowboy won’t report anything for me.

Similarly to zhangzhen 's problem, this is happening on prod when I handle webhooks (although from different system) and we did have issues that for example the JSON they were sending us didn’t comply with standard. I suspect there’s a similar issue here, that the machine-generated payload somehow is messed up and some linked process (or the process of handler) performs exit() ? rather than a crash.

My plan is, as I know the path that this is happening on, is to install a monitor from within a custom plug, and collect the exit reasons from requests on that path and maybe this will give me some idea what’s going on behind the scenes.

It’s a separate question if Cowboy / Plug / Phoenix should handle the situation more gracefully. In Cowboy’s documentation there are some hints that you want to monitor handler processes, like this one from here https://ninenines.eu/docs/en/cowboy/2.6/guide/handlers/:

This callback is optional because it is rarely necessary. Cleanup should be done in separate processes directly (by monitoring the handler process to detect when it exits).

but I understand that monitoring all handlers would have some performance punishment.

josevalim · April 20, 2021, 5:06pm

Can you try with something other than kill? That’s the “strongest” exit signal someone can submit and therefore we can’t generalize it.

Note this is not related to handlers though. Who is reporting this from Cowboy’s side is the connection, not the handler. And given Cowboy is already monitoring the handler process, I don’t see why it wouldn’t report other reasons too.

josevalim · April 20, 2021, 5:25pm

I have investigated this a bit, the issue is here:

github.com

ninenines/cowboy/blob/master/src/cowboy_stream_h.erl#L127-L141


info(StreamID, Info={'EXIT', Pid, normal}, State=#state{pid=Pid}) ->
	do_info(StreamID, Info, [stop], State);
info(StreamID, Info={'EXIT', Pid, {{request_error, Reason, _HumanReadable}, _}},
		State=#state{pid=Pid}) ->
	Status = case Reason of
		timeout -> 408;
		payload_too_large -> 413;
		_ -> 400
	end,
	%% @todo Headers? Details in body? Log the crash? More stuff in debug only?
	do_info(StreamID, Info, [
		{error_response, Status, #{<<"content-length">> => <<"0">>}, <<>>},
		stop
	], State);
info(StreamID, Exit={'EXIT', Pid, {Reason, Stacktrace}}, State=#state{ref=Ref, pid=Pid}) ->

Cowboy is not handling a {'EXIT', Pid, Whatever}, which is the format of exit reasons from linked processes. I would open up an issue on Cowboy and ask if they would consider adding a catch clause.

hubertlepicki · April 20, 2021, 7:04pm

Oh, I did assume they don’t intend to handle it but looking at the code it is more likely an unintended bug. I will report to them.

hubertlepicki · April 21, 2021, 8:32am

Actually I looked at the code more in details in Cowboy and it looks like they have the try/catch clause and this is the only way they are catching these exits.

So I think it is meant to catch exits coming from the process itself (i.,e. using exit("SOMETHING") in Elixir), but if some other process sends it an exit signal (i.e. if using Process.exit(pid, "SOMETHING") thisi s not being caught unless process is trapping exits, but even then the :kill wouldn’t be captured I believe.

So it’s not as simple as adding clause here as it won’t be caught by try/catch clause. The only way to detect this situation I think is to monitor the process that is being sent an exit signal.

I will open an issue on Cowboy but I doubt this is something they want to handle. We’ll see.

josevalim · April 22, 2021, 1:30pm

This is not correct. cowboy_http is handling exits and the code I linked above is the result of receiving a {'EXIT', PID, Reason):

github.com

ninenines/cowboy/blob/master/src/cowboy_http.erl#L258-L260


      
          		{'$gen_call', From, Call} ->
          			cowboy_children:handle_supervisor_call(Call, From, Children, ?MODULE),
          			loop(State);

hubertlepicki · April 22, 2021, 1:33pm

Yes, verified that, you are correct. It only won’t catch :kill I think.

zhangzhen · April 23, 2021, 8:54am

@hubertlepicki do you figure out what process sent the exit signal with :kill reason to the handler request?

Zhen

hubertlepicki · April 23, 2021, 10:05am

Sort of. Not fully.

So I have installed monitor on the cowboy processes that handle requests so I learn about their exit reasons. I caught this thing:

%Plug.Parsers.RequestTooLargeError{..}

Which doesn’t appear anywhere else in my logs or caught exceptions. It looks like some requests coming in are exceeding default 8MB of payload sizez, which triggers this error.

Now this is not really a kill reason I at all but this exception. I can replicate this reliabily against my app with:

curl --data "@/path/to/large/file" http://localhost:8080/webhooks/outbound/test

if my file size exceeds 8MB it crashes with no log line saying that response was sent.

Now, with log level set to :debug I see the actual error:

iex(a@127.0.0.1)8> 11:56:03.286 http_method=POST http_path=/webhooks/outbound/asdfasdf request_id=Fnh0I9UHbqEMJXkAAEjC remote_ip=127.0.0.1 [info] POST /webhooks/outbound/asdfasdf
11:56:03.288 http_method=POST http_path=/webhooks/outbound/asdfasdf request_id=Fnh0I9UHbqEMJXkAAEjC remote_ip=127.0.0.1 [info] Elixir.Infra.ExceptionCatcher.catch_exits/1 starts monitoring a pid: #PID<0.4378.0>
11:56:03.548 http_method=POST http_path=/webhooks/outbound/asdfasdf request_id=Fnh0I9UHbqEMJXkAAEjC remote_ip=127.0.0.1 [debug] ** (Plug.Parsers.RequestTooLargeError) the request is too large. If you are willing to process larger requests, please give a :length to Plug.Parsers
    (plug 1.11.1) lib/plug/parsers.ex:348: Plug.Parsers.reduce/8
    (webhooks 0.1.0) lib/webhooks/endpoint.ex:1: Webhooks.Endpoint.plug_builder_call/2
    (webhooks 0.1.0) lib/webhooks/endpoint.ex:4: anonymous fn/3 in Webhooks.Endpoint."call (overridable 3)"/2
    (appsignal 2.1.3) lib/appsignal/instrumentation.ex:9: Appsignal.Instrumentation.instrument/1
    (webhooks 0.1.0) lib/plug/debugger.ex:136: Webhooks.Endpoint."call (overridable 4)"/2
    (webhooks 0.1.0) lib/webhooks/endpoint.ex:1: Webhooks.Endpoint."call (overridable 5)"/2
    (webhooks 0.1.0) lib/plug/error_handler.ex:65: Webhooks.Endpoint.call/2
    (phoenix 1.5.8) lib/phoenix/endpoint/cowboy2_handler.ex:65: Phoenix.Endpoint.Cowboy2Handler.init/4
    ...
    (stdlib 3.13.2) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

11:56:03.549 [error] Elixir.Infra.ExceptionCatcher.catch_exits/1 detected a process that crashed: #PID<0.4378.0>

Note the last line is added by me and I cut off some project-specific lines from stacktrace.

I don’t quite understand why this exception is not handled similar way to others, other than it’s happening early in the Endpoint life cycle, during parsing the request body.

josevalim · April 23, 2021, 2:57pm

Plug.Cowboy only logs exceptions with status >= 500 but we literally merged a PR yesterday to make that configurable.

hubertlepicki · April 23, 2021, 3:17pm

Satisfying timing.

zhangzhen · April 25, 2021, 10:15am

@hubertlepicki:
I used the CrashesMonitor you wrote to print out the reason why the request handler process is brought down. Unfortunately, no related information was printed out in the console when requests were discarded. Should the monitor_crashes plug be placed in the endpoint.ex instead of in the controller? Because I suspect the request handling stops before going into the router plug. I investigate the requests that were silently discarded and found out that each request was sent twice in very short time interval by the nextflow runner. More that one week passed, this issue has not yet resolved, which frustrates me.

hubertlepicki · April 25, 2021, 10:47am

Hey… Which code are you using exactly? Because I don’t remember posting this code…

zhangzhen · April 25, 2021, 1:20pm

in the article above you put the code.

zhangzhen · April 26, 2021, 6:31am

the exit reason is :shutdown. Could anyone explain how cowboy works under the hood?

derek-zhou · April 26, 2021, 2:20pm

It will be much easier if you can trim out all proprietary information and make a minimal case so other people can reproduce locally. Since you said the controller does not matter, you can just remove it or make a simple stub. The external webhook can be replaced with curl invocations.