I’m thinking about rewriting metrics subsystem in my project. I wish to do it in some unified manner so it can be easily copied to other projects. I’m looking at telemetry and inspecting its reporters. We use influxdb, so I have to write a reporter because there isn’t any.
The reporter’s idea is probably the most complex part of telemetry. I compare Statsd and Prometheus reporters. Prometheus one does intermediate aggregation by itself, whereas statsd reporter doesn’t. With influxdb, we even don’t have different data types as we have with statsd. All aggregation in done afterwards. Intermediate aggregation looks very efficient because it is done inside Erlang VM.
So I want to create a summing counter that flushes data to influxdb on a regular basis. What is the right way to do it with telemetry? Should I write a reporter with an aggregation feature or write some external summing counter and flush its values using Telemetry.Poller and then write a telemetry reporter that writes data to influx without aggregation? In the second case, the new reporter should treat all types of telemetry metrics mostly the same. In the first case I need to specify some time interval when starting the reporter.
I guess the solution for doing it within the telemetry ecosystem would be have the telemetry reporter do aggregation/sampling. Afaik it’s the place meant to handle conversion between reported metrics and what’s actually send out to elsewhere. This can be as easy as just forwarding data or be quite complex by doing pre-aggregation or sampling. In the end it depends on how many moving parts one likes to have inside and/or outside of telemetry.
Yes, a proper place would be a reporter if one wants to integrate with Telemetry.Metrics, which provides nice abstraction over how events should be aggregated.
The InfluxDB reporter mentioned above takes a different approach where all events are pushed directly to InfluxDB, something that @rawkode suggested. The benefit of doing that instead of pre-aggregating is that we don’t need to know in advance what aggregations we’re going to run in order to analyze the data. As always, there are tradeoffs: pushing every event might be very bandwidth consuming, but aggregating in-process may consume considerable amount of memory.
As an author of mentioned library I confirm what the guys say;
InfluxDB reporter simply pushes the events and it’s up to InfluxDB’s user what kind of processing he’d like to apply on top of them. The idea of sampling was actually the next step for library’s improvement.
If you have any questions/suggestions for the implementation, I would be happy to help
Is InfluxData involved at all with https://opentelemetry.io/? The first draft of the metrics specification is being finished up now I believe.
If you are involved in OpenTelemetry or plan to support it then I think the best way for InfluxData to contribute to the BEAM ecosystem is through the Erlang/Elixir libraries, https://github.com/open-telemetry/opentelemetry-erlang
I should also mention for those looking for how to instrument and report their metrics, the idea is you can use OpenTelemetry for recording metrics in your application and reporting them to the OpenCensus Collector, https://github.com/open-telemetry/opentelemetry-collector, and this will then report to Influx. It should also be able to receive from influx instrumented code so if you have projects in other languages already instrumented with some influx library they can report to the same agent/collector.
Thank you for describing this reporting scheme. It was the first time I’ve heard about OpenTelemetry. The whole infrastructure looks very interesting but maybe a bit far from my today’s needs. But I will keep an eye on it
OpenTelemetry is actually pretty big project that is backed by CNCF (known from also backing k8s, Prometheus, Fluentd, and a lot of other projects) so I would say it is worth trying (soon, as this is still in the progress).
Yes, these tradeoffs are actually a question. Anyway, we shouldn’t send raw events to influxdb directly. Using telegraph we can send them through loopback interface using UDP. This way bandwidth isn’t very important. And then it is up to telegraph to aggregate and resend metrics to influxdb server.
Even with this case collecting metrics with something like ets:update_counter inside reporter should be more efficient.
So we have two variants with their pros and cons.
Current solution I tried is to use statsd through telegraf . That works, kinda… You end up creating measurements for everything instead of combination of measurements with fields and tags, etc… This reduces the ability to query that data by a big margin.
If there is a plan to release the native reporter in the foreseeable future, that would be awesome . Otherwise we need to rethink our metrics storage.
Are you guys planning on releasing a official telemetry reporter for InfluxDB?
Wow I wasn’t aware of the templates, thanks! I managed to make it behave now, but still it would be nice to have a native reporter. Then there’s no need to maintain all possible templates and you can use it the way it is without thinking about if it’s going to match the right templating rule in telegraf.
IIRC StatsD exporter for Telemetry supports DataDog-like tags, so with proper configuration the Telegraf will be just translator from one syntax to another.
There is EEF Observability WG that is working on monitoring facilities in Erlang (and naturally - Elixir). The current consensus is to use telemetry as event dispatcher that is backend agnostic. On top of that you can use any metrics gatherer you like, for example mentioned above telemetry_influxdb. If you want to have more “holistic” solution for monitoring you applications, then you can check out opentelemetry application that will provide you metrics and traces, in future maybe even logs, gathering together with tooling to dispatch that data to various storage and processing engines.