Tracking down slow queries in Ecto

gmile · September 6, 2023, 9:19am

Sometimes, Ecto (actually, DBConnection) spills an error like this:

DBConnection.ConnectionError: ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 11963ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:

  1. Ensuring your database is available and that you can connect to it
  2. Tracking down slow queries and making sure they are running fast enough
  3. Increasing the pool_size (although this increases resource consumption)

Item 2 in the list above suggests Tracking down slow queries. How do you locate the code that is sending a slow query to database, in production?

I’m curious how this can be done at scale because, for example, the codebase I am working on is vast, spanning over 1000 modules. There are tons of places in the code that may call database, and finding the right code is not that easy.

I recently noticed telemetry produced by Ecto includes stacktrace info, so maybe this could help tracing down code that produces slow queries. Has anyone tried that? If yes, how do you log/collect/store such stacktraces?

D4no0 · September 6, 2023, 9:35am

I would investigate first if this is actually related to slow queries, because the error is network related, there is a different error for query timeout.

benwilson512 · September 6, 2023, 11:38am

One of the more recent versions of Ecto added a stacktrace: true option that you can set on the repo. This provides a stacktrace for the query in the ecto telemetry handlers.

From there it’s mostly a matter of figuring out how you want to consume that information. The quick and dirty way is to just log any queries that took longer than a chosen threshold.

dimitarvp · September 6, 2023, 12:14pm

Assuming this is not the fault of the network as @D4no0 said, my second item on the list would be to inspect the telemetry traces; they should include duration.

Though I’m not sure they’re sent at all in case of timeouts.

You could temporarily override your Repo and have it make a root telemetry span if you can’t find them in your monitoring system.

benwilson512 · September 6, 2023, 12:28pm

You do indeed! Here is an example of the metadata arg value in such a case:

%{
     cast_params: nil,
     options: [],
     params: [],
     query: "select pg_sleep(5)",
     repo: Sensetra.Repo,
     result: {:error,
      %DBConnection.ConnectionError{
        message: "tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)",
        severity: :error,
        reason: :error
      }},
     source: nil,
     stacktrace: nil,
     type: :ecto_sql_query
   }
 }

Stacktrace is nil in this case because I was running this from iex.

dimitarvp · September 6, 2023, 12:56pm

Ha, cute. How do you cause a timeout like this btw?

fuelen · September 6, 2023, 1:22pm

APM in Datadog helps a lot

benwilson512 · September 6, 2023, 2:23pm

Repo.query!("select pg_sleep(5)", [], timeout: 1_000)

I’ve got this handy function for returning telemetry spans on a particular function:

def trace_ecto(fun) when is_function(fun, 0) do
    this_process = self()

    ref = make_ref()

    # here we're attaching a handler to the query event. When the query is performed in the same process as called this function
    # we want to basically "export" those values out to a list for investigation. Handlers are global though, so we need to
    # only `send` when we are in the current process.
    :telemetry.attach(
      "__help__",
      [:sensetra, :repo, :query],
      fn _, measurements, metadata, _config ->
        if self() == this_process do
          send(this_process, {ref, %{measurements: measurements, metadata: metadata}})
        end
      end,
      %{}
    )

    Repo.transaction(fun)

    :telemetry.detach("__help__")

    do_get_trace_messages(ref)
  end

  defp do_get_trace_messages(ref) do
    receive do
      {^ref, message} ->
        [message | do_get_trace_messages(ref)]
    after
      0 -> []
    end
  end

Basically you call trace_ecto(fn -> your_funciton_here() end) and it returns all of the ecto telemetry emitted by that function!

EDIT: Oh on second thought I really need to set it up to rescue or something to catch those timeouts as the actual return value. You still get them from the send though it just ends up in your mailbox and you have to flush() them out. Will post an updated version shortly.

gmile · September 6, 2023, 5:08pm

Thanks for mentioning this! We’re actually using AppSignal, I opened an issue to see if support for stacktrace can be implemented there: Collect Ecto stacktraces · Issue #885 · appsignal/appsignal-elixir · GitHub

fuelen · September 6, 2023, 5:34pm

Just keep in mind, that having a stacktrace in production comes with a performance cost.
Even without stacktrace, when an exception occurs, it should close the span/trace with the error, so you should be able to identify the place by looking at the trace.

benwilson512 · September 6, 2023, 5:51pm

It does but we haven’t found it to be an issue in practice. We do something on the order of 1000 ecto queries per second sustained and the stacktrace overhead isn’t noticeable.

gmile · September 6, 2023, 5:54pm

Out of curiosity, how are you storing the slow queries and stacktraces?

benwilson512 · September 6, 2023, 6:17pm

For a long time we just logged it but lately we integrated OpenTelemetry with Datadog traces. This was more complicated to set up than I’d like but it’s now pretty cool since we can get traces from all our different services and even correlate traces between services.

EDIT: Although I should add we use sampling on our major data ingestion pipelines to avoid generating too many traces.

kenny-evitt · September 7, 2023, 5:36pm

AppSignal has a standard “Slow queries” feature. It seems to work pretty well. (You do need to attach their Ecto handler in your app supervisor tree but I’m very sure their docs cover that.)

kenny-evitt · September 7, 2023, 5:37pm

I just looked at the issue you opened about adding support for stacktraces – I get what you’re missing now! That has been mildly annoying.

davydog187 · September 8, 2023, 2:29pm

You should consider integrating OpenTelemetry with OpenTelemetryEcto that will capture spans for your Ecto queries, and then ship them to a good observability tool like https://www.honeycomb.io/ or https://www.servicenow.com/products/observability.html

I wrote about this around a year ago and there’s another good post that goes into more practical detail

barkerja · September 8, 2023, 3:23pm

Thank you for these links! Any recommended resources for using DataDog with OpenTelemetry?

davydog187 · September 8, 2023, 3:41pm

Everything but the export will be exactly the same. The question is if you want to use the Otel Collector or not. Once you’re ready to configure exporting your data, you can check the Datadog Otel docs

gmile · September 13, 2023, 8:36pm

After looking further, it appears that in it’s current state OpenTelemetry spec does not provide means to send a generic stacktrace in a span It does supports sending stacktraces for exceptions though.

I found this conversation Structured Stacktrace in Span Attributes/Logs · Issue #2839 · open-telemetry/opentelemetry-specification · GitHub by @tristan. It seems relevant, but appears to be far from being complete / accepted into the spec. On top of not being part of the standard, it would have to be adopted by opentelemetry-ecto, and I believe vendors too.

In opentelemetry-ecto, there seems to be an escape hatch though - ability to provide a list of additional_attributes: https://github.com/open-telemetry/opentelemetry-erlang-contrib/blob/596fcbc2cbc53d6b31d1126e11a13c1a9ee36c0b/instrumentation/opentelemetry_ecto/lib/opentelemetry_ecto.ex#L122-L137. But it’s not clear to me at the moment how to provide a stacktrace (produced by Ecto) as an item in additional_attributes.

We’re using Google Cloud, which offers “Cloud Span” product, which recommends using OpenTelemetry and supports a stackTrace field in a span (see this).

So, I think potentially with some patching of opentelemetry-ecto and reading more of Google Cloud docs, sending spans containing Ecto stacktraces might work Will need to look more into this.

davydog187 · September 14, 2023, 1:43pm

Why do you need a stack trace? If you’re trying to get a stack trace for every span that is going to be expensive, not to mention a bit unnecessary if you’re adding your own spans to your application.

Let’s say you do want to create your own spans for some service/context functions in your application, you can add additional attributes like so

defmodule MyContext do
  require OpenTelemetry.Tracer
  alias OpenTelemetry.Tracer


  def list_products(%User{} = user) do
    attributes = %{user_id: user.id, user_name: user.name}
    Tracer.with_span "my-context:list_products", %{attributes: attributes} do
      MyApp.Repo.all(from p in Products, where: p.user_id == ^user.id)
    end
  end
end