Using Telemetry.Metrics counter/2 and :telemetry in general

seeplusplus · August 11, 2024, 4:07am

I’m trying to get started exporting custom app metrics to prometheus on Fly. This is my first time using any Telemetry packages in Elixir, so any tips or suggestions on how to improve this are welcome.

I’m using the Telemetry.Metrics package (default with Phoenix, I think?) and telemetry_metrics_prometheus to export the metrics. I think the latter was necessary to have an endpoint that prometheus could scrape.

My application has a custom cache and I am trying to measure the hits vs misses. So here’s how I have MyApplicationWeb.Telemetry.metrics/0:

def metrics() do
[
      # a ton of default `summary`s that came out of the box,
      last_value("vm.memory.total", unit: {:byte, :kilobyte}), # added for testing
      counter("my_application.some_mod.cache.hit"),
      counter("my_application.some_mod.cache.miss")
]
end

My application.ex has my Telemetry module and TelemetryMetricsPrometheus in the children list for its Supervisor:

children = [
      MyApplicationWeb.Telemetry,
      {TelemetryMetricsPrometheus, [metrics: MyApplicationWeb.Telemetry.metrics()]},
      # Rest of children
    ]

And the code in the caching module:

def contains?(word) do
    with term when term != nil <- Repo.get_by(Term, word: word) do
      Logger.debug("cache hit")
      :telemetry.execute([:my_application, :some_mod, :cache], %{hit: 1}, %{})
      term.is_valid
    else
      nil ->
        Logger.debug("cache miss")
        is_valid? = Inclusive.contains?(word)
        :telemetry.execute([:my_application, :some_mod, :cache], %{miss: 1}, %{})
        add_to_cache(word, is_valid?)
        is_valid?
    end
  end

I have tested this. The metrics seem to work in the Phoenix Live Dashboard as expected correctly, but when I hit the /metrics endpoint provided by TelemetryMetricsPrometheus the hit and miss counters always increment in lockstep. I can read the logs and verify there have been no cache hits, but hit will always go up at the same time as misses.

Have I configured these metrics correctly? Am I using :telemetry.execute properly? I didn’t find the docs for these packages to be very clear, sadly.

Here are some examples of the graphs of my metrics in the dashboard:

at the same time if I scrape the /metrics endpoint, it looks like the counters are being summed?

my_application_some_mod_cache_miss 8
my_application_some_mod_cache_hit 8
vm_memory_total 79445.464

seeplusplus · August 14, 2024, 10:56pm

I found the issue. I’ll document in case anyone else runs into this. This was mostly due to me failing to read and understand the docs for :telemetry_metrics.

From the docs:

The first argument to all metric functions is the metric name. Metric name can be provided as a string (e.g. "http.request.stop.duration") or a list of atoms ([:http, :request, :stop, :duration]). The metric name is automatically used to infer the telemetry event and measurement. For example, In the "http.request.stop.duration" example, the source event name is [:http, :request, :stop] and metric values are drawn from :duration measurement. Like this:

[:http , :request, :stop]      :duration
<----- event name ------> <-- measurement -->

Source.

In configuring my counters as:

counter("my_application.some_mod.cache.hit"),
counter("my_application.some_mod.cache.miss")

I had mistakenly configured two counters, both of which were counting the total number of my_application.some_mod.cache events. The end result: every time I did a :telemetry.execute([:my_application, :some_mod, :cache], %{miss: 1}, %{}), I was incrementing both counters. The fix, two different measurements on two different event names:

counter("my_application.some_mod.cache.hit.count"),
counter("my_application.some_mod.cache.miss.count")

Where I increment with,

# :miss or :hit below
# value: 1 is meaningless, it can be anything, e.g., %{count:1} - I think it can just be an empty map.
:telemetry.execute([:my_application, :some_mod, :cache, :miss], %{value: 1}, %{})

I ended up finding this by cloning the of the prometheus telemetry exporter locally and adding some console logs on the telemetry dispatcher. I’m not sure why the incorrect config rendered “correctly” (read: as I expected it to) in the Phoenix LiveDashboard.

dimitarvp · August 14, 2024, 11:01pm

This is great. Instant bookmark with tagging for me.