Use Flow for data aggregation

I am thinking of using Flow for data aggregation.

I have a large a mount of data, Lets say the CPU usage statistics across 100+ servers. Every 1 min, each server, each core will have a some CPU metrics collected, such as

     server1, cpu0, 00:01, 0.6, 0.4, 0.0
     server1, cpu1, 00:01, 0.6, 0.4, 0.0 
     server1, cpu2, 00:01, 0.6, 0.4, 0.0
     server1, cpu3, 00:01, 0.6, 0.4, 0.0
     server2, cpu0, 00:01, 0.6, 0.4, 0.0
     server2, cpu1, 00:01, 0.6, 0.4, 0.0 
     server2, cpu2, 00:01, 0.6, 0.4, 0.0
     server2, cpu3, 00:01, 0.6, 0.4, 0.0

     server1, cpu0, 00:02, 0.6, 0.4, 0.0
     server1, cpu1, 00:02, 0.6, 0.4, 0.0 
     server1, cpu2, 00:02, 0.6, 0.4, 0.0
     server1, cpu3, 00:02, 0.6, 0.4, 0.0
     server2, cpu0, 00:02, 0.6, 0.4, 0.0
     server2, cpu1, 00:02, 0.6, 0.4, 0.0 
     server2, cpu2, 00:02, 0.6, 0.4, 0.0
     server2, cpu3, 00:02, 0.6, 0.4, 0.0

I use Flow to aggregate the above data into average of a interval configurable (say 5min, 10min, …)

  • Use partition custom hash function to hash the data by using (server name + cpu number) as the hash key
  • reduce into arrays
  • next, by using custom hash to partition the data into the time range 0-5, 5-10, 10-15…50-55, 55-60
  • reduce the array by summing the value up, rember the total count
  • calculate the average based on the sum and count

But this seems like I am still slurping all the data into memory first. is there any better/efficient way of doing?

Not sure if there is a limit on the number of partitions, in my case I will need to have 100*4 partitions. If the server number increaese and the cores changes, I need to update the total partitions. Is there a way to update the total # of partitions dynamically?

How should i construct the flow logic together by using two different partition as above (one to group server cpu, one to group by time range)?

Thanks.

Be aware that at this time Flow is single server. Use GenStage if you want distribution.

2 Likes

Nice catch, Flow is concurrent and not distributed. For what I understand, @josevalim plans for Flow were an easy way to go from lazy (Streams) to concurrent in processing lists. Going from concurrent to distributed is the next step.