"No response body" on particular Stripe webhook events

Hi, all!

I’m Stripe on my project and I setup a webhook endpoint following this guide: Stripe Webhooks in Phoenix with Elixir Pattern Matching | Conner Fritz

It uses stripity-stripe, it’s pretty basic and works well, except it sometimes returns “No response body” on some particular events and I’m not sure why.

At first I had this issue consistently and I realized it was because I wasn’t returning a 200 immediately after the event was received, so I overcame it by triggering a Task to do what I need done and returning a 200 right after it.

So the way my webhook handlers look right now is:

defp handle_webhook(%{type: "some.stripe.event.type"}) do
    IO.puts("Something happened")
    Task.start(fn ->
        # Do something potentially slow
    end)
 
    :ok
end

So when the webhook handler returns that :ok the function that receives all webhooks calls a function that returns a 200 as response:

  defp handle_success(conn) do
    conn
    |> put_resp_content_type("text/plain")
    |> send_resp(200, "ok")
  end

So… am I missing something here? Is there something in this pattern that could potentially slow returning 200 to Stripe?

Thank you :slight_smile:

From the code you’ve shown, it looks OK. Although you should probably follow the docs and run your tasks under a supervisor, or better yet use something like Oban so as not to loose anything.

1 Like

Thank you for your answer!

I’m still quite new to Elixir mechanisms for concurrency, what would be the difference of running the tasks under a supervisor?
You mentioned the possibility of losing something… how would something like that happen?

When you start a task under a supervisor, you’re getting better observability (via the observer) and proper graceful shutdown - when you’re shutting down your application (e.g. because you’re doing a deployment of a new version), the application will wait until the task completes. The docs for Task.start:

If the current node is shutdown, the node will terminate even if the task was not completed. For this reason, we recommend to use Task.Supervisor.start_child/2 instead, which allows you to control the shutdown time via the :shutdown option.

With just Task.start, the VM shutdown will just kill the task immediately, which means you might “loose” the webhook (Stripe won’t retry as you’ve confirmed it’s been accepted).

Moving to a Task.Supervisor is pretty straightforward:

  • Add {Task.Supervisor, name: MyApp.TaskSupervisor} to your children in MyApp.Application,
  • Start tasks under the supervisor with Task.Supervisor.start_child:
Task.Supervisor.start_child(MyApp.TaskSupervisor, fn ->
  IO.puts "I am running in a task"
end)

start_child seems to be what you need in your use case, but see the docs if you need to await for the result of the task.

The rule of thumb is that no task should run unsupervised (if the task completes before the code that invokes it, you should be fine with using them directly though).

But all of this won’t protect you from the VM being insta-killed from the outside (e.g. OOM killer or a deployment process that doesn’t do proper app termination) - you’re still susceptible to loosing a webhook/task in the middle of processing. The answer to that would be to use persistent background jobs for processing. Oban is an amazing tool for that - you would enqueue a job in the controller (which is really fast) and the processing will happen in the background and you get access to all the features (retries, backoff, concurrency control, monitoring, etc.) as a bonus.

5 Likes

Thank you, this is actually something I haven’t thought of!

My application is on Cloud Run, which I imagine it means that when it is changing revisions it could flush all the data in memory when it is swapping containers in the middle of a payment, having it saved on the database it’s probably a very good idea.

I’m reading Oban docs now as per your suggestion and my question is: on the webhook handler, should I call Oban.insert() within a Task.Supervisor? It seems like too many layers, but since Oban is writing on the database, wouldn’t that mean there would be a little delay before the handler could send a response to the webhook?

Enqueueing an Oban job is really fast since it’s just a single insert into your DB, so it can be - and in this case should be - done synchronously. Yes, it introduces a small delay, but you’re winning the consistency here plus I’m sure you’ll still have plenty of time to do some more stuff in the handler if you need to. So it would be something like:

defp handle_webhook(%{type: "some.stripe.event.type"} = event) do
  event
  |> HandleStripeWebhookJob.new()
  # or Oban.insert() if you want explicit error handling, but this is unlikely to fail
  |> Oban.insert!()
end

Oban spawns a pool of workers that will process this job asynchronously in the background.

One of the coolest features of Oban is that it uses your DB and your Ecto repo, so you can insert a job from within a transaction for even more consistency. So you can store the event separately for better inspection:

defp handle_webhook(%{type: "some.stripe.event.type"} = event_data) do
  # returns {:ok, event} or {:error, reason}
  Repo.transaction(fn ->
    with {:ok, event} <- store_event(event_data),
         {:ok, _job} <- enqueue_job(event) do
      # event was stored and the job was enqueued
      # Repo.transaction will wrap "event" in an {:ok, event} tuple
      event
    else
      # something went wrong; discard the event and the job
      # Repo.transaction will wrap the "reason" in {:error, reason} tuple 
      {:error, reason} -> Repo.rollback(reason)
    end
  end)
end

defp store_event(event_data) do
  %{event_data: event_data}
  # assuming we have an Ecto schema named "Event"
  |> Event.create_changeset()
  |> Repo.insert()
end

defp enqueue_job(%Event{} = event) do
  %{event_id: event.id}
  |> HandleStripeWebhookJob.new()
  # Oban.insert is Repo.insert with extra features
  |> Oban.insert()
end
5 Likes

This is a great answer, thank you so much for this!

I’ve been meaning to look into Oban earlier but I was promising myself not to add unnecessary complexity until I actually needed it, so glad to see that this complexity is not all that complex — at the very basic use case at least :slight_smile:

2 Likes

I ran into a similar problem with the aid of the same article you mentioned . I think I found the fix . In Connor’s code , I believe the offending lines are :

{:ok, body, _} = Plug.Conn.read_body(conn)
conn

The conn returned is ‘stale’ in a sense . So I changed this to

{:ok, body, conn} = Plug.Conn.read_body(conn)
conn

and the problem seems to have gone away . I haven’t done any further digging into exactly why this is the case (the exact difference between the conn passed into Plug.Conn.read_body and the one that comes out , and how this affects the pipeline downstream . If anyone has insight , am curious to know)