Aggregating data, store in process or periodically send asynchronously

zachdaniel · June 23, 2017, 7:40am

In my library https://github.com/zachdaniel/spandex, I aggregate latency data in the form of spans using the process dictionary. As the process works it could be generating hundreds or even thousands of spans, depending on the usage. Currently it just keeps track of them in a list, and sends them when the entire trace/span is complete.

I’d like to get an intuition about how these spans (relatively small maps, with an occasional large string value) affect the memory footprint of a process. Additionaly, what kind of tradeoff should I expect if I were to look at periodically shipping spans in the middle of the process? The naive approach would be to have a threshold, say 20, of completed spans that warrant sending. Then, if at any point I have 20 or more completed spans I ship the ones that are completed. This lowers the memory footprint within the process, but periodically shipping the data requires me to send it to another process. This is the case because I can’t block the current process waiting on those spans to send (would be a horrible tracing library if the trace blocked to make network requests in the middle, I imagine) so I have to start another process to do it. Additionally, if sending periodically is the sensible solution, is spawn(fn -> do_work end) enough?

OvermindDL1 · June 23, 2017, 4:04pm

Seems fine to me, though wrappers around Task is more traditionally Elixir, you can put them in a supervisor too so you can introspect them with tools.

Why not store it all in ETS though, then you can have an external process dump the ETS table out somewhere on occasion?

Qqwy · June 23, 2017, 7:53pm

What is the reason behind storing the spans in the process dictionary rather than directly as the GenServer state (or as @OvermindDL1 suggested in an ETS table)?

zachdaniel · June 26, 2017, 8:29pm

Sorry all, had a very busy weekend! The main reason for not using a separate process to keep state is that many very short lived processes would want to start, log traces, and then die, and I was advised against requiring all of those processes to start another gen_server along side them. Think like 20-30 processes that only live for 1 second or less.