Peep - Efficient TelemetryMetrics reporter supporting Prometheus and StatsD

Peep is a new TelemetryMetrics reporter that supports both StatsD (and Dogstatsd) and Prometheus.

While load testing a new Websocket-based API gateway written in Elixir, I encountered performance issues with TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd. This prompted me to write Peep, which makes different choices about storing and sending TelemetryMetrics data.

  1. Instead of sampling or on-demand aggregation, Peep uses histograms (backed by :ets.update_counter/*) to store distributions, copying the approach taken by DDSketch.
  2. Instead of sending StatsD packets for each telemetry event, StatsD data is periodically sent in a small(er) number of large(r) packets.

This library is currently running in production, in a service handling >1 million requests per minute. With a moderate number of metrics defined, the service emits StatsD data at a rate of 4KiB/s, with no observed packet drops (we use Unix Domain Sockets to send Dogstatsd lines to Datadog agents, so itā€™s possible for :gen_udp to return :eagain when attempting to send packets).

Hereā€™s an image showing a drop in CPU use after replacing TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd with Peep:

Hereā€™s another dashboard for the same period of time, showing a slight (but not unwelcome!) drop in memory usage:

Feedback and contributions welcome!

22 Likes

Peep v2.0.0 has been released!

This version fixes an issue with exposing data for Prometheus. If you use Peep with Prometheus, you should upgrade to this version.

Changes

  • fixes an issue with Prometheus exposition where zero-valued bucket time series are not shown
  • Add support for custom bucket boundaries. As part of this change, the distribution_bucket_variability option was removed.

Custom bucket boundaries

With Peep 2.0.0, the default log-linear bucketing strategy becomes an implementation of the new Peep.Buckets behavior.

You can use the Peep.Buckets.Custom module to define your own bucket boundaries. This compiles to efficient pattern matching with function heads, which ought to scale better than traversing a list.

Hereā€™s an example of using Peep.Buckets.Custom:

defmodule MyBuckets do
  use Peep.Buckets.Custom, buckets: [
    1, 2, 5,
    10, 20, 50,
    100, 200, 500,
    1_000, 2_000, 5_000,
    10_000, 20_000, 50_000
    100_000, 200_000, 500_000,
    1_000_000, 2_000_000, 5_000_000
  ]
end

distribution("my.dist.with.custom.buckets", [reporter_options: [peep_bucket_calculator: MyBuckets]])

If you want something more involved, you can implement the callbacks for the Peep.Buckets behaviour. For an example, look at Peep.Buckets.Exponential.

4 Likes

Thank you for that project. It allowed me to give Supavisor ~30x boost in latency (measured by pgbench) over using telemetry_metrics_prometheus_core. I also have prepared PR for prom_ex to be able to use peep as a metrics store.

6 Likes

Thatā€™s awesome!

Iā€™m curious about how much impact it would have in my application, but I think canā€™t afford to test it right now, as it would be a pretty big change (we have a lot of Telemetry.Metrics.summary/2), which Peep doesnā€™t support.

Did you measure the impact of Telemetry.Metrics beforehand? Something like fprof? If you could please share, it could help me a lot :slight_smile:

Peep v3.0.0 has been released!

This version introduces a slight change in how Peep is configured (replacing keyword lists for maps in the global_tags option) that is not backwards compatible. Upgrading from v2.x.y will require making some small changes.

Thanks @hauleth for your contributions!

Changes

  • Added the Apache 2.0 licence text to the repo
  • Use maps for storing metrics tags
  • Precompute :math.log(gamma) for a slight performance boost in Peep.Buckets.Exponential
3 Likes

Peep v3.1.0 has been released!

Thanks to another contribution by @hauleth, it is now possible to override the type of a ā€˜sumā€™ or ā€˜last valueā€™ metric in the Prometheus exposition.

For example, if you want to track socket statistics, which are often pre-summed, you could store the data in peep with last_value/2, but report it as a counter-type metric in the Prometheus output.

3 Likes

I havenā€™t made many announcements here in a while, but Iā€™ve published a few new Peep versions. Thank you to @aloukissas and @mjm for your contributions!

Peep v3.2.0

@aloukissas added Peep.Plug, an easy way to expose Peep metrics.

Peep v3.2.1

While encountering an issue with Peep receiving unexpected messages when sending StatsD data via Unix Domain Sockets, @mjm changed Peep processes to ignore unxpected messages, and ignore the shutdown reason when terminating.

Peep v3.3.0

Upon request by my employer, I introduced a new storage engine for Peep metrics that trades reduced lock contention for increased memory usage; :striped. Rather than storing all metrics in a single ETS table, :striped uses one ETS table for each scheduler thread.

I donā€™t exactly recommend that users switch to this storage method unless they are noticing lock contention, which may happen when handling thousands and thousands of metrics of telemetry executions.

Hereā€™s some :lcnt output from a bidder service for RTB ads:

Before:

                    lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                   -----   ---  ------- ------------ --------------- ---------- -------------
            db_hash_slot  1856 26338259       668728          2.5390    5685223       55.1090 <- this is Peep!
               run_queue    46 25883442       343027          1.3253    1515654       14.6918
                  db_tab   130 47329791           79          0.0002    1306148       12.6610
           process_table     1  1209254        50187          4.1502    1174152       11.3815
            drv_ev_state   128  3566618        20546          0.5761     266746        2.5857
               proc_msgq 44073 13122428        28383          0.2163      99761        0.9670
          alcu_allocator    10   277563         1998          0.7198      46190        0.4477
               proc_main 44073 12147699        81435          0.6704      21809        0.2114
                pix_lock  1024      669           13          1.9432       7731        0.0749
               port_lock 43275  4875630          255          0.0052       3719        0.0360
   proc_sig_queue_buffer   128   905902         1603          0.1770       3332        0.0323
 erl_db_catree_base_node   122   532594          269          0.0505       2369        0.0230
         port_sched_lock 43277  2569013          364          0.0142        469        0.0045
             proc_status 44073  8947147           52          0.0006        148        0.0014
                proc_btm 44073  1327151           10          0.0008          3        0.0000

After:

                    lock    id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                   -----   ---  ------- ------------ --------------- ---------- -------------
               run_queue    46 24209562       727003          3.0030    1555422       15.3873
                  db_tab   170 39615183           98          0.0002    1052218       10.4093
            drv_ev_state   128  2919400        13453          0.4608     115025        1.1379
               proc_main 15318  9953578        78592          0.7896      36449        0.3606
               port_lock 14753  4085350          592          0.0145       8268        0.0818
          alcu_allocator    10    87449          253          0.2893       4058        0.0401
               proc_msgq 15318 10917596         6365          0.0583       2867        0.0284
            db_hash_slot  4608 22033063         2525          0.0115       2476        0.0245 <- this is Peep!
 erl_db_catree_base_node   219   454524          296          0.0651       1239        0.0123
         port_sched_lock 14755  2166480          472          0.0218        767        0.0076
             proc_status 15318  6785744          133          0.0020        415        0.0041
6 Likes

Sorry if this is the wrong place to put this but I feel like Iā€™m doing something dumb when setting this up.

Whenever I have the plug in my endpoint.ex file for a phoenix project before my router, only the metrics route matches and all the other routes show 404. If I put it after my router, all my routes show but I get a 404 for /metrics. I feel like I have the worker set up fine and everything else but setting up the plug I feel like Iā€™m missing something.

Hey! Not at all the wrong place to post. You found a bug in Peep :slight_smile:

When adding Peep.Plug to a Phoenix project, I find myself using the following:

forward("/metrics", to: Peep.Plug, worker: my_peep_worker)

Note that, for the time being, you may need to specify the path twice if you want to use a path other than ā€œ/metricsā€:

forward("/my-metrics", to: Peep.Plug, path: "/my-metrics", worker: my_peep_worker)

That should address your immediate issue.

Iā€™ll improve the documentation in Peep.Plug to reflect this, and I might change some of the code in there, such as not responding with 404 when the URL path does not match the metrics endpoint. While that code is easier to test, and makes sense when serving the metrics endpoint on a different port, the default behavior is confusing.

1 Like

Hey, thanks for the fast reply! Iā€™ll give that a go when I can but makes sense to me