Keep phoenix/cowboy requests alive after drop

Fake511 · September 15, 2020, 5:02pm

I’m having some trouble with a phoenix app. Essentially, the issue is that I’ve got mnesia in front of a database. I’m doing updates by having a db transaction inside an mnesia transaction. It seems that sometimes a request can get dropped after the db transaction is done but before the mnesia transaction is done. This leaves my app in a broken state - db and mnesia are out of sync.

Is there any way I can have the request processing go on after a request is dropped? To make sure I can manage proper cleanup if needed

outlog · September 15, 2020, 8:18pm

Hej Peter, velkommen til!

Not sure I completely understand the use case… I assume mnesia is used as a cache? eg. you gotta maintain single source of truth?

that said I would look at using Ecto.Multi https://hexdocs.pm/ecto/Ecto.Multi.html

if you want to do the optimistic path (which might “lie”) do the mnesia first in the “multi” then the db stuff - if db fails you can cleanup whatever made it into mnesia.

if you want to always be “truthful” do the db stuff first then the mnesia in the “multi”…

this ensures correct data at all times… so definitely take a look at implementing that…

though above should be implemented for data guarentees, it’s also relevant to look at the root cause - a (client/network) dropped request killing (underlying) processes, I’ve seen this happen, so I know it can happen…

how long does the request take and the data handling?

most likely look at protocol_options https://hexdocs.pm/plug_cowboy/Plug.Cowboy.html that leads to https://ninenines.eu/docs/en/cowboy/2.5/manual/cowboy_http/ - see this one for example on config’ing idle_timeout https://github.com/phoenixframework/phoenix/issues/3190

my best guess would be the shutdown_timeout - eg your data processing takes 6secs or more (due to queues or what not) and a request and dropped connection kills it after 5 secs

if you want to replicate locally I think something like this: (untested)

  def create_data(conn, _params) do
   task_time = 20_000
   task = Task.async(fn -> :timer.sleep(task_time);IO.inspect("task done");"done" end)
   data = Task.await(task)
   IO.inspect("got data")
   json(conn, %{data: data})
  end

if you hit that controller and then kill browser window the linked task.async should be killed after 5secs if I understand correctly, and you will not see “task done” in iex… might be totally wrong though - wouldn’t be the first time…

Edit: actually cowboy might even kill a “dirty” (eg. side effect: mnesia + db transaction) multi, so maybe skip that part and jump straight to replicating locally and looking at cowboy options…

Fake511 · September 15, 2020, 10:27pm

Takker

Yeah, using mnesia as a cache in front of a MySQL db. I would have considered just mnesia but that leaves me with problems of netsplits happen.

Anyway, will look into ecto.multi though I don’t think it will be a solution to this specific problem - but it might simplify some logic. The cowboy protocol options look more promising, maybe I can find something there. My plan B at the moment is simply spawning processed that aren’t linked to the controller processed. It’s messier but might be more manageable.

chulkilee · September 16, 2020, 2:59am

Sounds like you need to decouple web request process (from phoenix) and your actual work process - so that when the former dies, you want to keep the latter finishing the job.

However, I don’t think phoenix would kill the web request process when a request is dropped (e.g. connection is closed or dropped) - or does it? Could you reproduce that?

Fake511 · September 16, 2020, 4:54am

Yeah, I’m thinking this might be the answer as well. Was thinking maybe there’s an option for it, but will defining look in this direction if I don’t find something.

Running a test where the controller sleeps for a period and does io output before and after, it will only do the output before the sleep, not after, if the connection is dropped. Need to confirm with side effects more in line with my app but I think it’s the case

Fake511 · September 17, 2020, 6:25pm

The solution was indeed to run code as async tasks to have it in separate processes. Running it using Task.Supervisor.async_nolink proved very and fixed the issues.

outlog · September 17, 2020, 8:05pm

did you explore any of the cowboy options? to no avail or?

also how long running where the queries? short or multiple secs?

Fake511 · September 18, 2020, 7:20am

Looked at them, but as far as I can tell there already is a default timeout for shutdown set to 5 secs. Requests were dropped after 5 secs which caused the issue - changing to a 10 second timeout on the calling end fixed it. So we definitely did not have 5+5 secs before. But no, haven’t tried increasing that timeout, will give it a test.

And 99% of the requests run in less than 100ms, we just have a few longer running ones sure to mnesia restarting transactions