Cowboy handlers proably exit, not being logged or reported anywhere

zhangzhen · April 27, 2021, 3:25am

Unfortunately, my unit tests totally fail to cover this problem. Putting the code to the public is not an issue, but reproducing that some requests is silently discarded really is. The reason is

Nextflow pipeline needs to be run and the dna sequencing data needs to be prepared for the pipeline run.
The webhook url is at the private internet address.
Chinese dna data is prohibited to go outside of China according to our law.

I added the debug info to my code and did some investigation to identify the cause of this issue. I report what I find till today here and have still doubts about it. First I show some code snippets:
br-tower/lib/br-tower.ex

  def dispatch(event) do
    Logger.info("dispatch/1 received the event: #{inspect(event)}")
    case event do
      %PipelineStarted{basic_info: basic_info, batch_id: batch_id} ->
        basic_info.run_name
        |> Bucket.put(batch_id)
        BatchServer.transition(batch_id, event)
      %PipelineCompleted{basic_info: basic_info, batch_id: batch_id} ->
        batch = BatchServer.transition(batch_id, event)
        basic_info.run_name
        |> Bucket.delete
        batch
      %{basic_info: basic_info} ->
        batch_id = basic_info.run_name 
                   |> Bucket.get
        case batch_id do
          nil ->
            {:error, "batch_id not found"}
          _ -> 
            BatchServer.transition(batch_id, event)
        end
    end
  end

dispatch/1, one of the most important functions in the API, does some state transformation and saves the new state to ETS, and returns the new state. There is really no complex logic in it. BatchServer.transition/2 is the client function that invokes GenServer.call indeed. As to my understanding, GenServer.call will block the client process until the reply comes back. It has 5 second timeout by default, but the timeout won’t stop the processing in the server process. The issue occurs only when a few webhook post requests arrive the endpoint almost at the same time. Each request was sent again immediately after it was sent perhaps because the nextflow runner didn’t receive any response.

br-tower/lib/boundary/batch_server.ex:

  def handle_call({:transition, event}, _from, batch) do
    batch = handle_event(batch, event)
    {:reply, batch, Queries.update(batch), @default_timeout}
  end

  defp call(via, action) do
    pid =
      case GenServer.whereis(via) do
        nil ->
          {:ok, pid} = BatchSupervisor.start_batch_server(via)
          pid
        pid ->
          pid
      end
    Logger.info("before GenServer.call/2: #{inspect(pid)}, #{inspect(via)}, #{inspect(action)}")
    GenServer.call(pid, action)
  end

br-tower/lib/boundary/queries.ex

  def update(%Batch{} = batch) do
    true = :ets.insert(:batches, {batch.id, batch})
    batch
  end
  def update(_arg), do: :invalid_batch_struct

The debug info indicates that the GenServer.call hangs (as you said). A few subsequent webhook requests were silently discarded by Cowboy and were not handed over to the router, and then things went back to normal again. Cowboy creates a process for each request. Why were also these handler processes brought down? The GenServer process handles messages in its mailbox one by one, realizing the synchronization. Why were the new state not kept in the GenServer and also in the ETS table. I suspect update/1 may trigger the race condition.

@derek-zhou Thank you for your helpful feedback.

derek-zhou · April 27, 2021, 1:24pm

It is most likely several requests are handled at the same time and all got stuck at the hanged GenServer call. 5 seconds timeout happened and they all error out.

I find it very peculiar in the way you use the GenServer and ETS. you seems to be dynamically creating GenServer based on an external (webhook) originated batch_id? This is very vulnerable for malicious traffic.
And all the GenServer store some data in a shared ETS table using the batch_id as the key? (is is a set, bag or something else?) Why do you need the ETS, you can just keep the batch in the GenServer,

zhangzhen · April 27, 2021, 11:43pm

The explanation sounds reasonable and I personally agree with it, but no debug info supports it.

yes. My implementation is based upon the talk Lance Halvorsen @lance gave at ElixirConf 2018: ElixirConf 2018 - Elixir at a Walking Pace - Lance Halvorsen.

It is a set.

The reason is performance and scalability issues the GenServer approach might cause, which is described in Section 10.3 of the book Sasa Juric @sasajuric wrote: Elixir in Action, Second Edition.

I’m planning to use RabbitMQ to solve this issue. Once a Webhook event is received at Endpoint, it is put into a queue and 201 response is sent back to Nextflow runner immediately. The BatchServer takes the events from the queue and can process them at its own pace.

derek-zhou · April 28, 2021, 12:06am

If I were you, I will pre-create all my GenServers with legal batch_id. Then I will just reject illegal batch_id from the web hook. This is both simpler and safer.

Since you are accessing the data through a GenServer anyway, it is much faster to skip ETS altogether. I will not consider heavy weight tools until I am 100% sure my need cannot be served anyway else. it is the KISS doctrine.

zhangzhen · May 10, 2021, 6:51am

@hubertlepicki
The GET /api/batches/CT20_DS_downsample_1000X/items/RD2103883FFP request was sent to my web endpoint. The endpoint did nothing and no log information was printed out (the log level is set to :debug). I have no idea of what I do to address this problem.

prodis · August 24, 2021, 2:57pm

I had a similar problem with long requests (more than 60 seconds to respond) in a Phoenix API. Basically the process was gone and no more information or logs about it.

At the end I figured out that the reason was Cowboy exiting from the current process and closing the HTTP connection. From Cowboy 2.0, there is an idle timeout configuration option, that defaults to 60.000 milliseconds: Nine Nines: cowboy_http(3)

You can setup this timeout via Plug.Cowboy protocol_options if you are using Phoenix.

For example:

config :api, MyAPIWeb.Endpoint,
  http: [
    port: 4000,
    protocol_options: [idle_timeout: :infinity]
  ],
  ...

For a more detailed explanation and examples, this blog post helped me.