Help with telemetry config (aggregation)

I know very little about the Elixir telemetry libraries. I have an existing project I’ve started working on, and the cardinality of the metrics being scraped (by Prometheus) is way too high–resulting in really long scrape times, and timeouts. Current config looks like:

[
  Telemetry.Metrics.counter("observability.http.requests.start.count",
    event_name: [:observability, :http, :start],
    description: "The number of requests initiated against the given service",
    tags: [:route, :method, :remote_ip],
    tag_values: %{
      route: ...,
      method: ...,
      remote_ip: ...
    }
  ),
  Telemetry.Metrics.counter("observability.http.requests.stop.count",
    event_name: [:observability, :http, :stop],
    description: "The number of requests finished against the given service",
    tags: [:route, :method, :remote_ip, :status],
    tag_values: %{
      route: ...,
      method: ...,
      remote_ip: ...,
      status: ...
    }
  ),
  Telemetry.Metrics.distribution("observability.http.requests.duration",
    event_name: [:observability, :http, :stop],
    description: "The distribution of runtime of a request",
    reporter_options: [buckets: Time.buckets(opts)],
    unit: Time.unit(opts),
    tags: [:route, :method, :remote_ip, :status],
    tag_values: %{
      route: ...,
      method: ...,
      remote_ip: ...,
      status: ...
    }
  )
]

For the durations I don’t need timing by source IP of request, just by path. I suspect the person who wrote that intended that IP + path would uniquely identify a request, so that stop could be matched to start properly. But even that’s probably not quite right, as a client can send requests concurrently. Seems like a request ID would fulfill that intent. But anyway, that’s the crux of my question: stops have to be correlated to starts, but I don’t want a metric for each individual request. So there needs to be a re-aggregation at scrape time, I think.

For the stop & start, while the intent might have been to capture a notion of requests that don’t complete, I think I’d prefer to re-aggregate dropping out the path so that this becomes some idea of how much load comes from each client.

So do I need to create a custom reporter? Or is there an option I’m missing that would take care of this for me? Seems like tracking durations of requests by path would be a pretty common need…

1 Like