Creating persistent real-time analytics of time-series data?

iwasaki-kenta · November 29, 2017, 5:21pm

Hello,

I am looking to create a system which has to be able to sustain and persist approx. 1000 events per second from Phoenix channels, and compute statistics of the events over different time windows for an IoT project I’m doing.

The system for example given a bunch of temperature sensors scattered around a room would over windows of 1 minute, 1 hour, 6 hours, and per day compute the minimum mean temperatures, maximum mean temperatures, and opening/closing mean temperatures of the time windows and be able to report them to a web dashboard in real-time as events come by.

The statistics reported cannot be approximate (so techniques like average moving windows/rolling windows cannot be used for these aggregated statistics).

The events right now are aggregated onto a Postgres table, and I was thinking upon receiving the event I could recompute the means/mins/openings/closings using a Postgres aggregation query, though that would obviously be very tasking on the system as this aggregation query would have to go through all events thousands of times a second.

Another method I was thinking of was keeping a running count of the number of events (total count), and having a counter accumulating the temperature values over time on a separate Postgres table (to then compute the mean). I’m worried however that this accumulation would go past the data type limit after awhile and become a pain to handle in the database.

Timing accuracy and persistence if I were to do everything in memory (Redis, ETS) is also a major concern. I hopefully don’t want to touch a time-series database which holds continuous queries, though if anyone knows a very simple way to plug one in then I’d be very interested.

Anyone have ideas on how to architect such a system?

christhekeele · November 29, 2017, 6:27pm

The sort of system you are describing with these challenging requirements is the stuff of Enterprise in-house software with data pipelines that a lot of hardware has been thrown at, especially if you are avoiding time series databases based on the no-continuous-query criteria. Perhaps if one of your requirements was loosened you’d have more luck finding supporting software?

Just based on your light description, it sounds like a column-oriented database might be able to perform the calculations you are interested in faster than postgres, in exchange for longer insert times. Whether or not it is fast enough or worth the trade-off depends greatly on your data I suspect.

iwasaki-kenta · November 29, 2017, 7:02pm

The write rate to the database can be loosened definitely, though so long as the system is able to efficiently aggregate and store the statistics.

Definitely re-querying the table would be too naive, so I’m looking to solutions to do it perhaps in-memory or if there is a nice way to do it w/ Postgres in mind.

For the column-oriented database by the way, is there any you would personally recommend? Or would you know how to perhaps orient Postgres into having one of its tables be a column-oriented database?

andre1sk · November 29, 2017, 9:42pm

you don’t need column oriented DB you can use stock PG or timscaleDB (basically postgres extension) for storage. Out of curiosity why can’t numbers be approximate? I doubt a human operator can absorb data at 1000 refreshes per second so what is the point exactly?

iwasaki-kenta · November 29, 2017, 10:00pm

Right so that’s another thing, there perhaps might be a really small buffer of 100ms windows for the statistics aggregation before publishing it out to clients.

The reason why the numbers cannot be approximated is because there are certain sensors in which the average value is crucial for publishing some events in the system.

Have you had any experience with TimescaleDB by the way?

entone · November 29, 2017, 10:17pm

I highly recommend InfluxDB with a Grafana frontend

Definitely check out the continuous queries for rolling up data. I generally do 30 day retention for realtime data, 1y retention 15min rollup, and infinite retention for 1hr rollup data.

I generally use the Docker image to run influxdb as a service.

You can see my MQTT based backend here, Brood

iwasaki-kenta · November 29, 2017, 10:24pm

Is there any real-time output available in InfluxDB which I could broadcast out through Phoenix channels? It seems there is only a HTTP API which I would have to consistently query on the server-end.

entone · November 29, 2017, 10:50pm

You would probably want to check out Kapacitor

Here’s a great example of outputting a JSON result. https://docs.influxdata.com/kapacitor/v1.3/guides/live_leaderboard/

You could also have it POST to an http endpoint and multiplex the message from a cowboy handler.

iwasaki-kenta · November 29, 2017, 11:00pm

Guess in this case it may be better to just write my own server, monitor a buffer every 100ms and place updates in a Postgres table + broadcast them.

On every buffer clear, the server would query the last average + its timestamp and compute aggregations (open, close, high, low, averages) if the present timestamp compared to the last timestamp has passed the interval for a certain time window.

Forgot my math and realized that there are formulas which are not approximations for the average hahah.

new_average = average * (n - length(items)) / n + Enum.sum(items) / n

entone · November 29, 2017, 11:19pm

You could also use a subscription and write data out to a UDP socket. https://docs.influxdata.com/influxdb/v1.3/query_language/spec/#create-subscription

I think losing the robustness of InfluxDB query language would be hard.

I would imagine a system that creates several subscriptions on continuous queries tables, and pipes that data to a UDP server that pushes the data to your websocket clients.

entone · November 30, 2017, 2:35am

If you are thinking of rolling your own, you probably want to look at Flow

andre1sk · November 30, 2017, 2:53am

Not in production just played around with it but looks solid and there are very large production deployments

RikLempens · November 30, 2017, 4:04pm

You might want to check out SiriDB (http://siridb.net)

SiriDB is a 100% open source time series database and should definitely meet your needs.
One of our customers has a 6 node setup consisting of 3 replicated pools which handles a load of 200.000 events per second quite easily.

Currently we have connectors for C, NodeJS, Go and Python as well as a universal http connector.
Adding an elixir connector would be really great so feel free to contribute we sure will help where we can!

iwasaki-kenta · December 1, 2017, 6:20am

Is there any interface for reporting updates to continuous queries in real-time? If there is, I’d love to work on an Elixir connector for it.

RikLempens · December 6, 2017, 7:40am

At the moment we don’t have support for continues queries yet.
We are currently looking how we can implement this, what timeframe are you looking at?

iwasaki-kenta · December 6, 2017, 3:50pm

Preferably within the next week or two; they would save computing power by a lot.

Right now I have all time series data aggregated on Postgres, with a simple continuous query executor I did in Elixir keeping track of running amounts for different time windows.

kip · December 7, 2017, 8:42am

If you’re using Postgres for your standard DB, https://www.pipelinedb.com might be worth investigating. (I have no personal knowledge, I’ve stored this link for later investigation for my own stuff)