Observability ecosystem, aka what to choose / how to organise?

elvanja · August 29, 2023, 9:40am

In short, I am rather confused to the direction one should choose to setup observability for a phoenix app I am working on (API actually), or rather which components are needed and how to connect the dots in between. And now for the longer version

So far have managed to setup observability cluster like this:

otel collector
- collects traces emitted from phoenix app
- using otel/opentelemetry-collector-contrib:0.83.0 docker image
loki
- grafana source
- for collecting logs
- using grafana/loki:2.8.4 docker image
- not working at the moment
tempo
- grafana source
- moves traces between otel collector and grafana
- using grafana/tempo:2.2.1 docker image
prometheus
- grafana source
- scraps metrics from phoenix app
- using prom/prometheus:v2.37.9 docker image
grafana
- for visualising all the metrics, traces, logs
- using grafana/grafana:8.2.6 docker image

And from phoenix app perspective, it has several parts/options:

opentelemetry*
prom_ex*
telemetry*

As far as I can tell, opentelemetry* works for traces only. With only this in the app, and otel + tempo + grafana I can get the traces via tempo in grafana and inspect them, no issues. However, I see no metrics exposed. At least I could not find a way to do that. tempo seems to be compatible as well, so maybe otel is not needed and I can just use tempo (above architecture with tempo and otel is only because we already use otel for other service in our stack).

The prom_ex* part, along with prometheus + grafana solves the metrics part of the system. I can see e.g. VM or Ecto metrics in auto-generated dashboards. It has a set of predefined metrics it exposes, which is nice. It does not have anything to do with traces or telemetry* part though. I found that prom_ex has a plan to work with opentelemetry stack in the future but we’re not there yet (unless I’m mistaken).

The telemetry* part relates to Telemetry — Phoenix v1.7.7 and related exporters, e.g. could use telemetry_metrics_prometheus_core | Hex to export metrics to prometheus. This looks like a preferred way of doing things in Phoenix (basing that on the presence of this section in the documentation). But it does mean I need to compile metrics by hand and prepare related dashboards, something that comes out of the box with prom_ex*.

I am not entirely sure that this is the most optimised way of doing things. It seems to me that we have some overlapping going on. At least telemetry* and prom_ex seem to be doing the same thing. Any pointers? Maybe some combo is actually a good idea, e.g. do the defaults via prom_ex and custom stuff via telemetry approaches (even though custom things can be done in both)? Has anyone had success in connecting the dots across logs/traces/metrics and make it possible to navigate from one to the other in grafana even though the sources are actually different?

Thank you for your time!

mgibowski · August 29, 2023, 1:02pm

I think the standard at the moment would be to use OpenTelemetry for traces and Prometheus (via PromEx) for metrics.

Once OpenTelemetry metrics get complete and widely used, they will be compatible with Prometheus so you should be good with such setup.

It seems to me that we have some overlapping going on. At least telemetry* and prom_ex seem to be doing the same thing.

In my opinion the name telemetry of this library might be a bit misleading, as what it offers is actually intrumentation rather than telemetry (see the definitions here).

From the readme of telemetry:

Telemetry is a lightweight library for dynamic dispatching of events, with a focus on metrics and instrumentation. Any Erlang or Elixir library can use telemetry to emit events. Application code and other libraries can then hook into those events and run custom handlers.

So the telemetry library is basically a way for your code to provide some kind of hooks and dynamically adding/removing handlers for such hooks. It is used mostly to instrument code. telemetry handlers is where the integration with the actual telemetry solution takes place (via OpenTelemetry, Prometheus, or whatever you’d like…). At least that’s the standard in the libraries and allows for ecosystem to evolve and be pluggable. In your own code you could skip telemetry and for example emit OpenTelemetry traces directly… or choose to emit telemetry events and implement integration (for example with PromEx) as a telemetry handler. That’s how all the standard PromEx integrations are implemented.

Hope that helps a bit!

elvanja · August 29, 2023, 2:27pm

I think the standard at the moment would be to use OpenTelemetry for traces and Prometheus (via PromEx) for metrics.

Yeah, thought that that would be the case.

In my opinion the name telemetry of this library might be a bit misleading, as what it offers is actually intrumentation rather than telemetry

I was actually thinking of the entire set of libraries, including appropriate exporter. For prometheus/grafana that might be for example GitHub - beam-telemetry/telemetry_metrics_prometheus: Complete Telemetry.Metrics Reporter solution for Prometheus. But yes, it boils down to metrics being gathered separately from opentelemetry.

In your own code you could skip telemetry and for example emit OpenTelemetry traces directly… or choose to emit telemetry events and implement integration (for example with PromEx) as a telemetry handler.

If I understand this correctly, it is possible to build telemetry events via Phoenix style that generates telemetry supervisor, and then instead of using above mentioned exporter hook into that with PromEx? Generated metrics via metrics supervisor do not hook automatically into PromEx exposed metrics endpoint. At least I could not find a way to do so (as in, I don’t see supervisor metrics in Grafana). If that were possible, it would be nicer than having separate mechanism for PromEx exposed metrics vs supervisor approach.

I assume this should somehow be connected given both use telemetry to register events, but I can’t find a way to nicely combine the two. I can of course have two mechanisms in place, but that seems like a very strange idea. And PromEx offers nice stuff like generated and maintained dashboards. On the other hand, the supervisor seems to be the default Phoenix way (since it is generated for new project).

Thanks for the input!

dimitarvp · September 14, 2023, 2:56pm

Apologies for shilling but just making sure it’s on your radar: have you checked this all-in-one thingy?

I have already used it successfully for both OpenTelemetry spans (in Golang and Rust) and for logs (Golang). Haven’t open-sourced the Golang log converter / exporter yet but can very easily show it to you to orient you how to use it – assuming you have the time budget to put finishing touches on your setup and are not strictly looking for something 100% ready. (It’s pretty easy to use but still, it has to be said that not everyone wants to write their own HTTP client for it; though that’s also very easy in Elixir.)

I have been impressed with the speed and reliability of OpenObserve but I admit I haven’t looked into their aggregation abilities yet so my recommendation could be misguided.

elvanja · October 12, 2023, 9:24am

That’s an interesting project Unfortunately the described stack from the initial post is a given and can not be changed at this time. Thanks for the tip!