we’re playing with the telemetry library, using telemetry-metrics to specify metrics. Our use case is that we dynamically spawn jobs that are usually running for a while and terminate once their work is done. We’d like to collect metrics from these jobs, such as bytes processed, chunks of data processed, and the number of occurrences of certain events. The information we’d like telemetry to give us is this data aggregated per job. To achieve that, we’re currently using the
:job_id tag. The problem is, I don’t see a possibility to tell telemetry that a job terminated and it should forget the given tag value. Because of that, libraries like telemetry_metrics_prometheus report that value forever, which leads to a mem leak and growing overhead when we have thousands of jobs being spawned and terminated. According to the answer to this issue, Prometheus is not suitable for aggregating this kind of data, but it seems to me that it’s rather a limitation of telemetry itself. We can try using OpenTelemetry instead, but AFAIK it’s more about traces and metrics are still in alpha, while we need metrics in the first place.
I believe there must be a correct way of approaching this problem. Any suggestions appreciated