Sometimes, Ecto (actually, DBConnection) spills an error like this:
DBConnection.ConnectionError: ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 11963ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
1. Ensuring your database is available and that you can connect to it
2. Tracking down slow queries and making sure they are running fast enough
3. Increasing the pool_size (although this increases resource consumption)
Item 2 in the list above suggests Tracking down slow queries. How do you locate the code that is sending a slow query to database, in production?
I’m curious how this can be done at scale because, for example, the codebase I am working on is vast, spanning over 1000 modules. There are tons of places in the code that may call database, and finding the right code is not that easy.
I recently noticed telemetry produced by Ecto includes stacktrace info, so maybe this could help tracing down code that produces slow queries. Has anyone tried that? If yes, how do you log/collect/store such stacktraces?
I would investigate first if this is actually related to slow queries, because the error is network related, there is a different error for query timeout.
One of the more recent versions of Ecto added a stacktrace: true option that you can set on the repo. This provides a stacktrace for the query in the ecto telemetry handlers.
From there it’s mostly a matter of figuring out how you want to consume that information. The quick and dirty way is to just log any queries that took longer than a chosen threshold.
Assuming this is not the fault of the network as @D4no0 said, my second item on the list would be to inspect the telemetry traces; they should include duration.
Though I’m not sure they’re sent at all in case of timeouts.
You could temporarily override your Repo and have it make a root telemetry span if you can’t find them in your monitoring system.
You do indeed! Here is an example of the metadata arg value in such a case:
%{
cast_params: nil,
options: [],
params: [],
query: "select pg_sleep(5)",
repo: Sensetra.Repo,
result: {:error,
%DBConnection.ConnectionError{
message: "tcp recv: closed (the connection was closed by the pool, possibly due to a timeout or because the pool has been terminated)",
severity: :error,
reason: :error
}},
source: nil,
stacktrace: nil,
type: :ecto_sql_query
}
}
Stacktrace is nil in this case because I was running this from iex.
I’ve got this handy function for returning telemetry spans on a particular function:
def trace_ecto(fun) when is_function(fun, 0) do
this_process = self()
ref = make_ref()
# here we're attaching a handler to the query event. When the query is performed in the same process as called this function
# we want to basically "export" those values out to a list for investigation. Handlers are global though, so we need to
# only `send` when we are in the current process.
:telemetry.attach(
"__help__",
[:sensetra, :repo, :query],
fn _, measurements, metadata, _config ->
if self() == this_process do
send(this_process, {ref, %{measurements: measurements, metadata: metadata}})
end
end,
%{}
)
Repo.transaction(fun)
:telemetry.detach("__help__")
do_get_trace_messages(ref)
end
defp do_get_trace_messages(ref) do
receive do
{^ref, message} ->
[message | do_get_trace_messages(ref)]
after
0 -> []
end
end
Basically you call trace_ecto(fn -> your_funciton_here() end) and it returns all of the ecto telemetry emitted by that function!
EDIT: Oh on second thought I really need to set it up to rescue or something to catch those timeouts as the actual return value. You still get them from the send though it just ends up in your mailbox and you have to flush() them out. Will post an updated version shortly.
Just keep in mind, that having a stacktrace in production comes with a performance cost.
Even without stacktrace, when an exception occurs, it should close the span/trace with the error, so you should be able to identify the place by looking at the trace.
It does but we haven’t found it to be an issue in practice. We do something on the order of 1000 ecto queries per second sustained and the stacktrace overhead isn’t noticeable.
For a long time we just logged it but lately we integrated OpenTelemetry with Datadog traces. This was more complicated to set up than I’d like but it’s now pretty cool since we can get traces from all our different services and even correlate traces between services.
EDIT: Although I should add we use sampling on our major data ingestion pipelines to avoid generating too many traces.
AppSignal has a standard “Slow queries” feature. It seems to work pretty well. (You do need to attach their Ecto handler in your app supervisor tree but I’m very sure their docs cover that.)
Everything but the export will be exactly the same. The question is if you want to use the Otel Collector or not. Once you’re ready to configure exporting your data, you can check the Datadog Otel docs
After looking further, it appears that in it’s current state OpenTelemetry spec does not provide means to send a generic stacktrace in a span It does supports sending stacktraces for exceptions though.
We’re using Google Cloud, which offers “Cloud Span” product, which recommends using OpenTelemetry and supports a stackTrace field in a span (see this).
So, I think potentially with some patching of opentelemetry-ecto and reading more of Google Cloud docs, sending spans containing Ecto stacktraces might work Will need to look more into this.
Why do you need a stack trace? If you’re trying to get a stack trace for every span that is going to be expensive, not to mention a bit unnecessary if you’re adding your own spans to your application.
Let’s say you do want to create your own spans for some service/context functions in your application, you can add additional attributes like so
defmodule MyContext do
require OpenTelemetry.Tracer
alias OpenTelemetry.Tracer
def list_products(%User{} = user) do
attributes = %{user_id: user.id, user_name: user.name}
Tracer.with_span "my-context:list_products", %{attributes: attributes} do
MyApp.Repo.all(from p in Products, where: p.user_id == ^user.id)
end
end
end