How to correctly use Spandex for tracing Ecto calls on async processes?

srpablo · July 20, 2022, 5:00am

Hi friends!

My company is using Datadog APM to profile traces for our Phoenix app, and we’re using a combo of spandex, spandex_phoenix, spandex_datadog, and spandex_ecto to send traces. They look great!

The trouble is that we lose spans on our traces when we use Task.async and Task.await for multiple parallel DB calls. In code, it initially looked something like this:

def function_making_db_calls() do
  synchronous_function_calling_db()
  task1 = Task.async(fn -> my_function_calling_the_db() end)
  task2 = Task.async(fn -> other_function_calling_the_db() end)

  # other logic

  %MyStruct{
    data1: Task.await(task1),
    data2: Task.await(task2)
  }
end

Here, the first DB call (outside the Task.async/await) traces correctly, and we can see how long it takes in the context of the toplevel request. But the ones in the other processes don’t trace correctly.

Now! I saw this tip from the Spandex README:

Asynchronous Processes

The current trace_id and span_id can be retrieved and later used (for example, from another process) as follows:
trace_id = Tracer.current_trace_id()
span_id = Tracer.current_span_id()
span_context = %SpanContext{trace_id: trace_id, parent_id: span_id}
Tracer.continue_trace("new_trace", span_context)
New spans can then be logged from there and sent in a separate batch.

So I did something like this, but to no avail:

defp trace_aware_async(span_label, func) do
  trace_id = Tracer.current_trace_id()
  span_id = Tracer.current_span_id()
  span_context = %SpanContext{trace_id: trace_id, parent_id: span_id}
  Task.async(fn ->
    Tracer.continue_trace(span_label, span_context)
    func.()
  end)
end

def function_making_db_calls() do
  synchronous_function_calling_db()
  task1 = trace_aware_async("task1_span", fn -> my_function_calling_the_db() end)
  task2 = trace_aware_async("task2_span", fn -> other_function_calling_the_db() end)

  # other logic

  %MyStruct{
    data1: Task.await(task1),
    data2: Task.await(task2)
  }
end

But still no dice

I started digging into the sources and may continue to do this, and this person seems to have printed logs to find/correct discrepancies in the trace_id, but I’d rather not trouble my teammates with multiple PR reviews and monitor so many deploys. Has anyone else had experience with this, or sees what I’m doing wrong?

Thanks so much!

srpablo · August 1, 2022, 6:46pm

Finishing this off, I was forgetting to put Tracer.finish_trace at the end. Over on this GitHub Issue, Santiago Ferreira posted an example that was helpful if anyone else finds this organically

defp trace_aware_async(span_label, func) do
  current_context = Tracer.current_context()

  Task.async(fn ->
    case current_context do
      {:ok, context} ->
        Tracer.continue_trace(span_label, context)

        try do
          func.()
        after
          Tracer.finish_trace()
        end
      {:error, _reason} ->
        func.()
    end
  end)
end