Unfortunately, my unit tests totally fail to cover this problem. Putting the code to the public is not an issue, but reproducing that some requests is silently discarded really is. The reason is
- Nextflow pipeline needs to be run and the dna sequencing data needs to be prepared for the pipeline run.
- The webhook url is at the private internet address.
- Chinese dna data is prohibited to go outside of China according to our law.
I added the debug info to my code and did some investigation to identify the cause of this issue. I report what I find till today here and have still doubts about it. First I show some code snippets:
br-tower/lib/br-tower.ex
def dispatch(event) do
Logger.info("dispatch/1 received the event: #{inspect(event)}")
case event do
%PipelineStarted{basic_info: basic_info, batch_id: batch_id} ->
basic_info.run_name
|> Bucket.put(batch_id)
BatchServer.transition(batch_id, event)
%PipelineCompleted{basic_info: basic_info, batch_id: batch_id} ->
batch = BatchServer.transition(batch_id, event)
basic_info.run_name
|> Bucket.delete
batch
%{basic_info: basic_info} ->
batch_id = basic_info.run_name
|> Bucket.get
case batch_id do
nil ->
{:error, "batch_id not found"}
_ ->
BatchServer.transition(batch_id, event)
end
end
end
dispatch/1, one of the most important functions in the API, does some state transformation and saves the new state to ETS, and returns the new state. There is really no complex logic in it. BatchServer.transition/2 is the client function that invokes GenServer.call indeed. As to my understanding, GenServer.call will block the client process until the reply comes back. It has 5 second timeout by default, but the timeout won’t stop the processing in the server process. The issue occurs only when a few webhook post requests arrive the endpoint almost at the same time. Each request was sent again immediately after it was sent perhaps because the nextflow runner didn’t receive any response.
br-tower/lib/boundary/batch_server.ex:
def handle_call({:transition, event}, _from, batch) do
batch = handle_event(batch, event)
{:reply, batch, Queries.update(batch), @default_timeout}
end
defp call(via, action) do
pid =
case GenServer.whereis(via) do
nil ->
{:ok, pid} = BatchSupervisor.start_batch_server(via)
pid
pid ->
pid
end
Logger.info("before GenServer.call/2: #{inspect(pid)}, #{inspect(via)}, #{inspect(action)}")
GenServer.call(pid, action)
end
br-tower/lib/boundary/queries.ex
def update(%Batch{} = batch) do
true = :ets.insert(:batches, {batch.id, batch})
batch
end
def update(_arg), do: :invalid_batch_struct
The debug info indicates that the GenServer.call hangs (as you said). A few subsequent webhook requests were silently discarded by Cowboy and were not handed over to the router, and then things went back to normal again. Cowboy creates a process for each request. Why were also these handler processes brought down? The GenServer process handles messages in its mailbox one by one, realizing the synchronization. Why were the new state not kept in the GenServer and also in the ETS table. I suspect update/1 may trigger the race condition.
@derek-zhou Thank you for your helpful feedback.