Use Flow for data aggregation

zmw · May 31, 2017, 7:58am

I am thinking of using Flow for data aggregation.

I have a large a mount of data, Lets say the CPU usage statistics across 100+ servers. Every 1 min, each server, each core will have a some CPU metrics collected, such as

     server1, cpu0, 00:01, 0.6, 0.4, 0.0
     server1, cpu1, 00:01, 0.6, 0.4, 0.0 
     server1, cpu2, 00:01, 0.6, 0.4, 0.0
     server1, cpu3, 00:01, 0.6, 0.4, 0.0
     server2, cpu0, 00:01, 0.6, 0.4, 0.0
     server2, cpu1, 00:01, 0.6, 0.4, 0.0 
     server2, cpu2, 00:01, 0.6, 0.4, 0.0
     server2, cpu3, 00:01, 0.6, 0.4, 0.0

     server1, cpu0, 00:02, 0.6, 0.4, 0.0
     server1, cpu1, 00:02, 0.6, 0.4, 0.0 
     server1, cpu2, 00:02, 0.6, 0.4, 0.0
     server1, cpu3, 00:02, 0.6, 0.4, 0.0
     server2, cpu0, 00:02, 0.6, 0.4, 0.0
     server2, cpu1, 00:02, 0.6, 0.4, 0.0 
     server2, cpu2, 00:02, 0.6, 0.4, 0.0
     server2, cpu3, 00:02, 0.6, 0.4, 0.0

I use Flow to aggregate the above data into average of a interval configurable (say 5min, 10min, …)

Use partition custom hash function to hash the data by using (server name + cpu number) as the hash key
reduce into arrays
next, by using custom hash to partition the data into the time range 0-5, 5-10, 10-15…50-55, 55-60
reduce the array by summing the value up, rember the total count
calculate the average based on the sum and count

But this seems like I am still slurping all the data into memory first. is there any better/efficient way of doing?

Not sure if there is a limit on the number of partitions, in my case I will need to have 100*4 partitions. If the server number increaese and the cores changes, I need to update the total partitions. Is there a way to update the total # of partitions dynamically?

How should i construct the flow logic together by using two different partition as above (one to group server cpu, one to group by time range)?

Thanks.

technomage · June 1, 2017, 4:19am

Be aware that at this time Flow is single server. Use GenStage if you want distribution.

kelvinst · June 2, 2017, 1:18pm

Nice catch, Flow is concurrent and not distributed. For what I understand, @josevalim plans for Flow were an easy way to go from lazy (Streams) to concurrent in processing lists. Going from concurrent to distributed is the next step.