Aggregating telemetry metrics for each dynamically spawned job

mat-hek · April 25, 2022, 9:21am

Hi there,

we’re playing with the telemetry library, using telemetry-metrics to specify metrics. Our use case is that we dynamically spawn jobs that are usually running for a while and terminate once their work is done. We’d like to collect metrics from these jobs, such as bytes processed, chunks of data processed, and the number of occurrences of certain events. The information we’d like telemetry to give us is this data aggregated per job. To achieve that, we’re currently using the :job_id tag. The problem is, I don’t see a possibility to tell telemetry that a job terminated and it should forget the given tag value. Because of that, libraries like telemetry_metrics_prometheus report that value forever, which leads to a mem leak and growing overhead when we have thousands of jobs being spawned and terminated. According to the answer to this issue, Prometheus is not suitable for aggregating this kind of data, but it seems to me that it’s rather a limitation of telemetry itself. We can try using OpenTelemetry instead, but AFAIK it’s more about traces and metrics are still in alpha, while we need metrics in the first place.

I believe there must be a correct way of approaching this problem. Any suggestions appreciated

RudManusachi · April 26, 2022, 1:51am

I don’t have an answer, but wanted to share that in the past I faced the same issue with assigning unique tags to metrics (very high cardinality). We were using DataDog and eventually we got notified from their support that if we continue that way our bills would become enormous. And were suggested to stop using metrics for those types of tags and switch to APM traces where they provide with “infinite-cardinality” attributes.