Peep - Efficient TelemetryMetrics reporter supporting Prometheus and StatsD

Peep is a new TelemetryMetrics reporter that supports both StatsD (and Dogstatsd) and Prometheus.

While load testing a new Websocket-based API gateway written in Elixir, I encountered performance issues with TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd. This prompted me to write Peep, which makes different choices about storing and sending TelemetryMetrics data.

  1. Instead of sampling or on-demand aggregation, Peep uses histograms (backed by :ets.update_counter/*) to store distributions, copying the approach taken by DDSketch.
  2. Instead of sending StatsD packets for each telemetry event, StatsD data is periodically sent in a small(er) number of large(r) packets.

This library is currently running in production, in a service handling >1 million requests per minute. With a moderate number of metrics defined, the service emits StatsD data at a rate of 4KiB/s, with no observed packet drops (we use Unix Domain Sockets to send Dogstatsd lines to Datadog agents, so it’s possible for :gen_udp to return :eagain when attempting to send packets).

Here’s an image showing a drop in CPU use after replacing TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd with Peep:

Here’s another dashboard for the same period of time, showing a slight (but not unwelcome!) drop in memory usage:

Feedback and contributions welcome!

23 Likes

Peep v2.0.0 has been released!

This version fixes an issue with exposing data for Prometheus. If you use Peep with Prometheus, you should upgrade to this version.

Changes

  • fixes an issue with Prometheus exposition where zero-valued bucket time series are not shown
  • Add support for custom bucket boundaries. As part of this change, the distribution_bucket_variability option was removed.

Custom bucket boundaries

With Peep 2.0.0, the default log-linear bucketing strategy becomes an implementation of the new Peep.Buckets behavior.

You can use the Peep.Buckets.Custom module to define your own bucket boundaries. This compiles to efficient pattern matching with function heads, which ought to scale better than traversing a list.

Here’s an example of using Peep.Buckets.Custom:

defmodule MyBuckets do
  use Peep.Buckets.Custom, buckets: [
    1, 2, 5,
    10, 20, 50,
    100, 200, 500,
    1_000, 2_000, 5_000,
    10_000, 20_000, 50_000
    100_000, 200_000, 500_000,
    1_000_000, 2_000_000, 5_000_000
  ]
end

distribution("my.dist.with.custom.buckets", [reporter_options: [peep_bucket_calculator: MyBuckets]])

If you want something more involved, you can implement the callbacks for the Peep.Buckets behaviour. For an example, look at Peep.Buckets.Exponential.

4 Likes

Thank you for that project. It allowed me to give Supavisor ~30x boost in latency (measured by pgbench) over using telemetry_metrics_prometheus_core. I also have prepared PR for prom_ex to be able to use peep as a metrics store.

6 Likes

That’s awesome!

I’m curious about how much impact it would have in my application, but I think can’t afford to test it right now, as it would be a pretty big change (we have a lot of Telemetry.Metrics.summary/2), which Peep doesn’t support.

Did you measure the impact of Telemetry.Metrics beforehand? Something like fprof? If you could please share, it could help me a lot :slight_smile:

Peep v3.0.0 has been released!

This version introduces a slight change in how Peep is configured (replacing keyword lists for maps in the global_tags option) that is not backwards compatible. Upgrading from v2.x.y will require making some small changes.

Thanks @hauleth for your contributions!

Changes

  • Added the Apache 2.0 licence text to the repo
  • Use maps for storing metrics tags
  • Precompute :math.log(gamma) for a slight performance boost in Peep.Buckets.Exponential
3 Likes

Peep v3.1.0 has been released!

Thanks to another contribution by @hauleth, it is now possible to override the type of a ā€˜sum’ or ā€˜last value’ metric in the Prometheus exposition.

For example, if you want to track socket statistics, which are often pre-summed, you could store the data in peep with last_value/2, but report it as a counter-type metric in the Prometheus output.

3 Likes

I haven’t made many announcements here in a while, but I’ve published a few new Peep versions. Thank you to @aloukissas and @mjm for your contributions!

Peep v3.2.0

@aloukissas added Peep.Plug, an easy way to expose Peep metrics.

Peep v3.2.1

While encountering an issue with Peep receiving unexpected messages when sending StatsD data via Unix Domain Sockets, @mjm changed Peep processes to ignore unxpected messages, and ignore the shutdown reason when terminating.

Peep v3.3.0

Upon request by my employer, I introduced a new storage engine for Peep metrics that trades reduced lock contention for increased memory usage; :striped. Rather than storing all metrics in a single ETS table, :striped uses one ETS table for each scheduler thread.

I don’t exactly recommend that users switch to this storage method unless they are noticing lock contention, which may happen when handling thousands and thousands of metrics of telemetry executions.

Here’s some :lcnt output from a bidder service for RTB ads:

Before:

                    lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                   -----   ---  ------- ------------ --------------- ---------- -------------
            db_hash_slot  1856 26338259       668728          2.5390    5685223       55.1090 <- this is Peep!
               run_queue    46 25883442       343027          1.3253    1515654       14.6918
                  db_tab   130 47329791           79          0.0002    1306148       12.6610
           process_table     1  1209254        50187          4.1502    1174152       11.3815
            drv_ev_state   128  3566618        20546          0.5761     266746        2.5857
               proc_msgq 44073 13122428        28383          0.2163      99761        0.9670
          alcu_allocator    10   277563         1998          0.7198      46190        0.4477
               proc_main 44073 12147699        81435          0.6704      21809        0.2114
                pix_lock  1024      669           13          1.9432       7731        0.0749
               port_lock 43275  4875630          255          0.0052       3719        0.0360
   proc_sig_queue_buffer   128   905902         1603          0.1770       3332        0.0323
 erl_db_catree_base_node   122   532594          269          0.0505       2369        0.0230
         port_sched_lock 43277  2569013          364          0.0142        469        0.0045
             proc_status 44073  8947147           52          0.0006        148        0.0014
                proc_btm 44073  1327151           10          0.0008          3        0.0000

After:

                    lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                   -----   ---  ------- ------------ --------------- ---------- -------------
               run_queue    46 24209562       727003          3.0030    1555422       15.3873
                  db_tab   170 39615183           98          0.0002    1052218       10.4093
            drv_ev_state   128  2919400        13453          0.4608     115025        1.1379
               proc_main 15318  9953578        78592          0.7896      36449        0.3606
               port_lock 14753  4085350          592          0.0145       8268        0.0818
          alcu_allocator    10    87449          253          0.2893       4058        0.0401
               proc_msgq 15318 10917596         6365          0.0583       2867        0.0284
            db_hash_slot  4608 22033063         2525          0.0115       2476        0.0245 <- this is Peep!
 erl_db_catree_base_node   219   454524          296          0.0651       1239        0.0123
         port_sched_lock 14755  2166480          472          0.0218        767        0.0076
             proc_status 15318  6785744          133          0.0020        415        0.0041
6 Likes

Sorry if this is the wrong place to put this but I feel like I’m doing something dumb when setting this up.

Whenever I have the plug in my endpoint.ex file for a phoenix project before my router, only the metrics route matches and all the other routes show 404. If I put it after my router, all my routes show but I get a 404 for /metrics. I feel like I have the worker set up fine and everything else but setting up the plug I feel like I’m missing something.

Hey! Not at all the wrong place to post. You found a bug in Peep :slight_smile:

When adding Peep.Plug to a Phoenix project, I find myself using the following:

forward("/metrics", to: Peep.Plug, worker: my_peep_worker)

Note that, for the time being, you may need to specify the path twice if you want to use a path other than ā€œ/metricsā€:

forward("/my-metrics", to: Peep.Plug, path: "/my-metrics", worker: my_peep_worker)

That should address your immediate issue.

I’ll improve the documentation in Peep.Plug to reflect this, and I might change some of the code in there, such as not responding with 404 when the URL path does not match the metrics endpoint. While that code is easier to test, and makes sense when serving the metrics endpoint on a different port, the default behavior is confusing.

1 Like

Hey, thanks for the fast reply! I’ll give that a go when I can but makes sense to me

I missed posting a few minor releases of Peep in the past few months, but today, Peep 4.0.0 was released.

Upgrading from 3.x should be straightforward, as the only backwards-incompatible change made is that you can no longer store non-integer values in last-value metrics. You may have somehow gotten away with it in earlier versions of Peep, but it won’t work anymore.

Here’s a changelog of releases from Peep 3.3.1 to 4.0.0:

v3.3.1

  • I added :on_unmatched_path to Peep.Plug, allowing for users to decide the behaviour when Plug.Peep is called with an unexpected path. In short, you probably want :continue if serving Peep metrics from the same HTTP server as your application (e.g. in a Phoenix-based service), and :halt when serving Peep metrics from a separate listener (e.g. when you want to serve metrics on a different port)
  • @yordisprieto contributed some documentation

v3.4.0

  • @josevalim implemented several optimizations the Prometheus export code, making it more efficient.
  • @yordisprieto added a compile-time :bucket_calculator option to Peep, making it possible to globally specify a Peep.Buckets implementation for all distribution metrics.
  • @hst337 implemented some optimizations, most impressively a change to Peep.Buckets.Custom that switches from linear-ish function head matching to logarithmic binary search. For small numbers of buckets, the performance is roughly equivalent. However, for large numbers of buckets, performance is much improved. This PR inspired me to contribute a similar optimization to a Cassandra client library that we use in production for a nice performance improvement!

v3.4.1

  • @scudelletti fixed an issue of requiring Plug at compile-time, making Plug truly an optional dependency for Peep.

v3.4.2

  • @hauleth fixed a bug around quote-escaping labels in Peep’s Prometheus export code.

v3.5.0

  • At @hauleth’s request, I implemented Peep.prune_tags/2, which deletes metrics based on matching tag values. This is useful for metrics with unavoidable high cardinality, but may not be all that useful for more typical users of Peep.

v4.0.0

  • Introduces Peep.Codegen, an internal module that compiles a module on Peep startup with a handle_event/4 function that replaces Peep.EventHandler.handle_event/4. This avoids the overhead of copying data out of :telemetry’s ETS table into a calling process’s heap, as that data could be stored in the compiled module’s literal pool instead. Further, Peep now gives integer ids to each metric, and uses those integer ids instead of Telemetry.Metrics structs when looking up data in ETS tables, further reducing unnecessary copying of terms.
  • Peep now drops non-number measurements for last-value metrics. Thanks to @akoutmos for originally pointing out this (mis)behaviour.
  • Peep now automatically adds global tags to metrics, rather than forcing users to repeatedly specify the global tag keys in :tags for every metric. Thanks to @yordisprieto for pointing this out, and pointing out the fix.

Peep v4.0.0 appears to be quite a bit faster than v3.x. Here’s a chart showing p50, p95, and p99 latency before and after deployment of Peep 4.0.0 in an application that uses Peep heavily:

Thanks to those who contributed to Peep these past few months. With your help, Peep is now better than ever!

5 Likes