Peep is a new TelemetryMetrics reporter that supports both StatsD (and Dogstatsd) and Prometheus.
While load testing a new Websocket-based API gateway written in Elixir, I encountered performance issues with TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd. This prompted me to write Peep, which makes different choices about storing and sending TelemetryMetrics data.
Instead of sampling or on-demand aggregation, Peep uses histograms (backed by :ets.update_counter/*) to store distributions, copying the approach taken by DDSketch.
Instead of sending StatsD packets for each telemetry event, StatsD data is periodically sent in a small(er) number of large(r) packets.
This library is currently running in production, in a service handling >1 million requests per minute. With a moderate number of metrics defined, the service emits StatsD data at a rate of 4KiB/s, with no observed packet drops (we use Unix Domain Sockets to send Dogstatsd lines to Datadog agents, so itās possible for :gen_udp to return :eagain when attempting to send packets).
Hereās an image showing a drop in CPU use after replacing TelemetryMetricsPrometheus.Core and TelemetryMetricsStatsd with Peep:
This version fixes an issue with exposing data for Prometheus. If you use Peep with Prometheus, you should upgrade to this version.
Changes
fixes an issue with Prometheus exposition where zero-valued bucket time series are not shown
Add support for custom bucket boundaries. As part of this change, the distribution_bucket_variability option was removed.
Custom bucket boundaries
With Peep 2.0.0, the default log-linear bucketing strategy becomes an implementation of the new Peep.Buckets behavior.
You can use the Peep.Buckets.Custom module to define your own bucket boundaries. This compiles to efficient pattern matching with function heads, which ought to scale better than traversing a list.
Thank you for that project. It allowed me to give Supavisor ~30x boost in latency (measured by pgbench) over using telemetry_metrics_prometheus_core. I also have prepared PR for prom_ex to be able to use peep as a metrics store.
Iām curious about how much impact it would have in my application, but I think canāt afford to test it right now, as it would be a pretty big change (we have a lot of Telemetry.Metrics.summary/2), which Peep doesnāt support.
Did you measure the impact of Telemetry.Metrics beforehand? Something like fprof? If you could please share, it could help me a lot
This version introduces a slight change in how Peep is configured (replacing keyword lists for maps in the global_tags option) that is not backwards compatible. Upgrading from v2.x.y will require making some small changes.
Thanks to another contribution by @hauleth, it is now possible to override the type of a āsumā or ālast valueā metric in the Prometheus exposition.
For example, if you want to track socket statistics, which are often pre-summed, you could store the data in peep with last_value/2, but report it as a counter-type metric in the Prometheus output.
I havenāt made many announcements here in a while, but Iāve published a few new Peep versions. Thank you to @aloukissas and @mjm for your contributions!
While encountering an issue with Peep receiving unexpected messages when sending StatsD data via Unix Domain Sockets, @mjm changed Peep processes to ignore unxpected messages, and ignore the shutdown reason when terminating.
Peep v3.3.0
Upon request by my employer, I introduced a new storage engine for Peep metrics that trades reduced lock contention for increased memory usage; :striped. Rather than storing all metrics in a single ETS table, :striped uses one ETS table for each scheduler thread.
I donāt exactly recommend that users switch to this storage method unless they are noticing lock contention, which may happen when handling thousands and thousands of metrics of telemetry executions.
Hereās some :lcnt output from a bidder service for RTB ads:
Sorry if this is the wrong place to put this but I feel like Iām doing something dumb when setting this up.
Whenever I have the plug in my endpoint.ex file for a phoenix project before my router, only the metrics route matches and all the other routes show 404. If I put it after my router, all my routes show but I get a 404 for /metrics. I feel like I have the worker set up fine and everything else but setting up the plug I feel like Iām missing something.
Hey! Not at all the wrong place to post. You found a bug in Peep
When adding Peep.Plug to a Phoenix project, I find myself using the following:
forward("/metrics", to: Peep.Plug, worker: my_peep_worker)
Note that, for the time being, you may need to specify the path twice if you want to use a path other than ā/metricsā:
forward("/my-metrics", to: Peep.Plug, path: "/my-metrics", worker: my_peep_worker)
That should address your immediate issue.
Iāll improve the documentation in Peep.Plug to reflect this, and I might change some of the code in there, such as not responding with 404 when the URL path does not match the metrics endpoint. While that code is easier to test, and makes sense when serving the metrics endpoint on a different port, the default behavior is confusing.
I missed posting a few minor releases of Peep in the past few months, but today, Peep 4.0.0 was released.
Upgrading from 3.x should be straightforward, as the only backwards-incompatible change made is that you can no longer store non-integer values in last-value metrics. You may have somehow gotten away with it in earlier versions of Peep, but it wonāt work anymore.
Hereās a changelog of releases from Peep 3.3.1 to 4.0.0:
v3.3.1
I added :on_unmatched_path to Peep.Plug, allowing for users to decide the behaviour when Plug.Peep is called with an unexpected path. In short, you probably want :continue if serving Peep metrics from the same HTTP server as your application (e.g. in a Phoenix-based service), and :halt when serving Peep metrics from a separate listener (e.g. when you want to serve metrics on a different port)
@josevalim implemented several optimizations the Prometheus export code, making it more efficient.
@yordisprieto added a compile-time :bucket_calculator option to Peep, making it possible to globally specify a Peep.Buckets implementation for all distribution metrics.
@hst337 implemented some optimizations, most impressively a change to Peep.Buckets.Custom that switches from linear-ish function head matching to logarithmic binary search. For small numbers of buckets, the performance is roughly equivalent. However, for large numbers of buckets, performance is much improved. This PR inspired me to contribute a similar optimization to a Cassandra client library that we use in production for a nice performance improvement!
v3.4.1
@scudelletti fixed an issue of requiring Plug at compile-time, making Plug truly an optional dependency for Peep.
v3.4.2
@hauleth fixed a bug around quote-escaping labels in Peepās Prometheus export code.
v3.5.0
At @haulethās request, I implemented Peep.prune_tags/2, which deletes metrics based on matching tag values. This is useful for metrics with unavoidable high cardinality, but may not be all that useful for more typical users of Peep.
v4.0.0
Introduces Peep.Codegen, an internal module that compiles a module on Peep startup with a handle_event/4 function that replaces Peep.EventHandler.handle_event/4. This avoids the overhead of copying data out of :telemetryās ETS table into a calling processās heap, as that data could be stored in the compiled moduleās literal pool instead. Further, Peep now gives integer ids to each metric, and uses those integer ids instead of Telemetry.Metrics structs when looking up data in ETS tables, further reducing unnecessary copying of terms.
Peep now drops non-number measurements for last-value metrics. Thanks to @akoutmos for originally pointing out this (mis)behaviour.
Peep now automatically adds global tags to metrics, rather than forcing users to repeatedly specify the global tag keys in :tags for every metric. Thanks to @yordisprieto for pointing this out, and pointing out the fix.
Peep v4.0.0 appears to be quite a bit faster than v3.x. Hereās a chart showing p50, p95, and p99 latency before and after deployment of Peep 4.0.0 in an application that uses Peep heavily: