Telemetry_metrics_logflare - Ship your Telemetry metrics to Logflare

chasers · October 13, 2020, 10:16pm

For your review … telemetry_metrics_logflare

You can now easily ship your Telemetry events to Logflare. This setup a bit different in that this reporter does not aggregate metrics in your app. It ships individual events to Logflare so we can store each metric event (plus the metadata) and let BigQuery do the aggregations after the fact. This keeps you from having to create hundreds, thousands or millions of different metrics.

It also lets you do dynamic aggregations on historical data. Imagine sending over the whole Conn as metadata on a duration metric. That might be a bit excessive depending on what you stuff in your Conn but it’s potentially possible and would let you answer just about any unknown unknown in the future.

Examples

Avg response time of 4xx HTTP status code responses by second

Average memory per minute on a specific node

p99 of queries ran against the properties table by second

en30 · October 14, 2020, 3:15am

Looks great!

I recently started looking around a SaaS to store Telemetry metrics.
What do you mean by Supported Metrics? Is’t Telemetry general enough for a reporter to process an event without knowing what name it has? What happens if I want to collect custom events specific to my application?

chasers · October 14, 2020, 6:04am

Yes, it is … we are just trying to shape some of these so they make sense to potentially put similar data in the same place in the metadata so you could create an alert like m.tag:"error" and it would get you all errors for all metrics we’ve officially integrated. Definitely still playing with the shape of these payloads.

We do still need a generic so we can support custom ones. You can also just use our Logger backend to send over a log event with some custom metadata. That is how I’ve setup our entire dashboard actually.

This should land in the next week or so.

hauleth · October 14, 2020, 11:53am

Well, it is important to remember that these are still logs not metrics. So while nice addition, in real systems I would beware of using such setup as it can easily fill whole IO with “metrics” alone leaving no bandwidth for the “useful work”. If you want something like that then you should use local aggregator to not overload network with sending all that events one by one.

chasers · October 14, 2020, 4:05pm

I’ll preface this whole thing with the fact that I’m definitely not trying to start a fight with the guy who wrote the new structured Logger stuff. So anything here comes off negatively that’s not the tone I’m going for, maybe I’m being a bit sarcastic sometimes but all in jest really … I appreciate the discussion.

Anyways … How does this all relate to tracing then? This is exactly what tracing is except each event is even larger than the metric events we’re generating. We will have an Open Telemetry exporter too.

Mostly I’m tired of vendors trying to act like correlating metrics and logs via a timestamp is actually helping anyone.

At some point with some data you should aggregate in the client but in practice for probably 95% of apps out there you won’t need to worry about that and if you reach that scale it’s a great problem to have.

There are definitely some things we can do here to mitigate any issues resource constrained production apps might have rolling this out. We are already batching events over to Logflare so we’re NOT doing 1 outbound request per metric event.

We should probably provide a config option to sample metric events. Probably two configs … one to start sampling at N events, and one to set the sample rate when sampling (edit: planned).

Ha, thanks?

What constitutes a “real system”? I’ve tested this up to 1000 requests a second and all seems good to me.

Understanding the production characteristics of what you’ve built is second really only to what you’ve built. And when what you’ve built isn’t working understand why becomes more important.

hauleth · October 14, 2020, 5:15pm

It is like pyramid, you measure what you can (metrics), you message where you need some debugging (logs), and trace on the high level. So you end with a lot of metrics, some logs, and few traces. Often you even implement sampling on traces to reduce amount of them, because as you noticed, some traces can be huge.

Each of these is also useful in different situations:

metrics are used for “early warning”, we want to know what is going on and be able to react before there is a problem
logs are used to find out where is the problem, check out for bugs and sometimes for other finding malicious parties (fail2ban)
traces are used for profiling applications, finding bottlenecks, and monitoring how services interact with each other

So you see why it is often important to differentiate between them.

In short:

Metrics tell you when is something happening in your application
Logs tell you what is happening in your application
Traces tell you why and how is something happening in your application

Sometimes it help, sometimes it doesn’t.

I would say that it depends on the amount of data you want to gather. Often, with broad monitoring, it will come to you sooner than later, even before “reaching scale”.

Great.

I meant systems with many metrics and heavy traffic. If we are monitoring only for the HTTP requests then often it will be enough, however, as in article I linked, even if you batch them you are limited by size of each log. See that each one log entry will contain about 28 bytes of data that isn’t really needed there (timestamp) as we are more interested about rate of the events, not exact time when these events happen.

Yeah, for starting projects it may be useful and interesting solution, however if you grow at least a little then you may encounter some problems (AFAIK, please correct me if I am wrong)

Logflare UI do not support comparing and looking for correlations in graphs of the metrics
There is no way to do more complex analysis like counting derivations, computing trends, etc. in the Logflare queries
There is no alerting mechanism built into Logflare, which is the reason for using metrics
There is only one graph in the Logflare UI - bar graph, which shows only rate of the events which is useful, but sometimes you need other graphs (heat maps, gauges, etc.). I do not really see how you would check CPU or memory usage using such UI

So as I said, it is useful, but you will very quickly grow out of it and you will need a “real” metrics gathering setup.

chasers · October 14, 2020, 6:16pm

Sure sure, yeah and there’s no reason why you can’t aggregate metrics on the client and then send them over when you need to.

We have solutions of your examples I think. There is definitely work to do here on the features themselves and surfacing the more advanced use cases.

With BigQuery and the way we implemented it, you get Google Data Studio for free. While not perfect it does let you build out any kind of dashboard you can think of. We will be adding some initial dashboarding features though too.

Mostly true but because we’ve setup BigQuery the way we have if you’re on a paid account you can query your data directly with SQL and include that in your Data Studio reports.

With LQL … you can do like m.inventory.product.color:~"blue|black" and it’ll give you a bar chart of events where the product is blue or black via a regex match. You can then do like m.inventory.product.color:~"blue|black" m.inventory.product.status:"sold" c:sum(m.inventory.product.qty) and the chart will be a timeseries of the sum blue or black sold products moving through your inventory.

You can also use the same LQL queries to route logs to another source and alert when that source gets events. Some examples would be:

m.vm.memory.last_values.total:>8000000000 and you’ll get alerts when you ram usage is over 8GB
m.phx.status:>499 to get alerted about all your 5xx responses
m.level:"error" on a LogflareLogger source and get alerted for all your error logs
m.phx.request_path:~"/signup" and regex match looking for paths with /signup in them for signup alerts
m.phx.endpoint.stop.every.duration:>5000000000 to get all requests over 5 seconds
m.repo_name.repo.query.every.total_time:>5000000000 all ecto queries over 5 seconds