I am thinking of using Flow for data aggregation.
I have a large a mount of data, Lets say the CPU usage statistics across 100+ servers. Every 1 min, each server, each core will have a some CPU metrics collected, such as
server1, cpu0, 00:01, 0.6, 0.4, 0.0
server1, cpu1, 00:01, 0.6, 0.4, 0.0
server1, cpu2, 00:01, 0.6, 0.4, 0.0
server1, cpu3, 00:01, 0.6, 0.4, 0.0
server2, cpu0, 00:01, 0.6, 0.4, 0.0
server2, cpu1, 00:01, 0.6, 0.4, 0.0
server2, cpu2, 00:01, 0.6, 0.4, 0.0
server2, cpu3, 00:01, 0.6, 0.4, 0.0
server1, cpu0, 00:02, 0.6, 0.4, 0.0
server1, cpu1, 00:02, 0.6, 0.4, 0.0
server1, cpu2, 00:02, 0.6, 0.4, 0.0
server1, cpu3, 00:02, 0.6, 0.4, 0.0
server2, cpu0, 00:02, 0.6, 0.4, 0.0
server2, cpu1, 00:02, 0.6, 0.4, 0.0
server2, cpu2, 00:02, 0.6, 0.4, 0.0
server2, cpu3, 00:02, 0.6, 0.4, 0.0
I use Flow to aggregate the above data into average of a interval configurable (say 5min, 10min, …)
- Use partition custom hash function to hash the data by using (server name + cpu number) as the hash key
- reduce into arrays
- next, by using custom hash to partition the data into the time range 0-5, 5-10, 10-15…50-55, 55-60
- reduce the array by summing the value up, rember the total count
- calculate the average based on the sum and count
But this seems like I am still slurping all the data into memory first. is there any better/efficient way of doing?
Not sure if there is a limit on the number of partitions, in my case I will need to have 100*4 partitions. If the server number increaese and the cores changes, I need to update the total partitions. Is there a way to update the total # of partitions dynamically?
How should i construct the flow logic together by using two different partition as above (one to group server cpu, one to group by time range)?
Thanks.