Telemetry/PromEx metrics in a cluster

Sorry if this is a daft question, but I can’t seem to find any documentation that answers my question …

We’re in the process of migrating a Phoenix application from a single node to an OTP cluster. Almost everything is working nicely, apart from some custom metrics we’re using to track usage of various parts of the application.

We have a simple prom_ex plugin added to our prom_ex plug, which handles telemetry events that we send when certain actions happen,. Here’s a slightly simplified version:

defmodule MyApp.PromEx.StatsMetrics do
  use PromEx.Plugin

  @impl true
  def event_metrics(_opts) do
    Event.build(
      :my_app_stats_event_metrics,
      [
        sum(
          [:my_app, :stats, :page_visits],
          tags: [:tool, :role],
          description: "The count of a tool being visited by someone in a role"
        )
      ]
    )
  end
end

… with events are emitted at appropriate points:

:telemetry.execute([:my_app, :stats], %{page_visits: 1}, %{
  tool: :some_tool,
  role: user.role.name
})

The problem is that we now have one instance of the plugin running on each node in the cluster, so stats are recorded separately for each node, and when Prometheus scrapes the numbers it sees the counts varying wildly as the load balancer routes it to a random node each time.

What we’d like to end up with is either a single instance of prom_ex (or just this plugin?) on the cluster (eg using highlander), or to somehow guarantee that the events are broadcast (using Phoenix pubsub, maybe?) so that all instances of prom_ex show the same values (but then what happens when a node is temporarily taken out of the cluster for an application upgrade?)

It feels like there’s probably a simple way of achieving this and I’m missing something obvious – any ideas?

Thanks!

1 Like

That is bad idea. What if the node storing all metrics will go down? You will lose everything you have.


In Supavisor we had encountered similar problem and our solution was pretty different:

  • each node collects their own metrics
  • at the time of export we traverse metrics from all nodes and merge them
  • then we do export

Code:

This requires Peep as a storage backend (which is much more performant from my experience).

Thanks – I’ll give this approach some thought.

As our use case was fairly simple, I ended up removing PromEx altogether, storing the counts in the database, and generating the metrics page with a simple controller action and Ecto query.

Can you explain why you did this rather than scraping each node’s individual metrics? What are you getting from merging them in-cluster rather than at the prometheus/dashboard level?

We exposé metrics for individual clients as well, so we would need that gathering metrics from all nodes anyway. That way it is also easier to gather metrics as we support self-hosting, which makes the operations much easier. And with current implementation it is quite robust solution.

2 Likes