Persistent real-time statistics for time series data

I’m working on a high-availability real-time application. Currently, it’s composed of a number of “bots”, each of which listens on WebSocket endpoints for data, responds to the data in various ways, and stores important events in CSV files. The system works great so far, and is able to run all day without problems.

I’d like to add the ability to compute statistics for these events, as they come in, to determine the performance of each “bot”.

I also have another, totally unrelated Phoenix web application which relies on very similar statistics. It stores the time series events in a Postgres database and re-calculates the needed statistics whenever someone visits a page. There aren’t any performance issues there – yet.

Ideally I’d like both of these applications to rely on the same solution. I don’t believe the re-calculation is the most appropriate solution for the real-time app, though…please correct me if I’m wrong!

I’ve read through the post on Creating Persistent Real-Time Analytics of Time Series Data, but that post is a few years old, plus my use case is much less intense – for now.

Here are the important data points for the real-time app:

  • fewer than 100 events per day
  • fewer than 20 bots running at a given time
  • the calculations themselves aren’t computationally intense (for now)

Here are the features I’d like the statistics-tracking application to have:

  • persistent: if the system crashes, I should be able to load in the last known stats or compute the stats from events generated so far.
  • lightweight: I’d like this to be easy to include in other projects that deal with similar data, so I’d like to keep dependencies to a minimum. I’m not opposed to using a database, though, if it proves to be the most appropriate solution.
  • configurable: I’d like it to be relatively easy to support new statistics for the system as needed. I’d also like to be able to choose a subset of the statistics to track if I want.

Right now, my idea is to use a Supervisor with an Agent for each statistic I want to calculate and track. I’m a bit lost when it comes to the persistence part, though. I’m also new to time-series calculations and tools in general.

Do you have any suggestions for libraries or built-in tools I can use to solve my problem without too much overkill? Am I overthinking this? Thanks in advance!

1 Like

As you are already using PostgreSQL then maybe TimescaleDB will be something that will fit your needs (it is PostgreSQL extension).

2 Likes

I don’t know enough about your case to determine whether you are looking for a centralized store like Postgres (where multiple app instances would use the same database) or a local one (each app instance having its own storage). If you need a centralized one, you can discard the rest of my post :slight_smile:

If you are looking for a local storage, you could check out CubDB (disclaimer: I am the author of the library). It is an embedded persistent storage written in Elixir, supporting key/value access and sorted ranges, kind of like a durable Map where you can select range of entries sorted by key. Storing time series and selecting ranges is quite easy and performant, you could follow the approach explained here but using timestamps instead of integer IDs):

{:ok, db} = CubDB.start_link(data_dir: "some/dir")

# Save a measurement (temperature for the
# sake of the example):
key = {:temperature, :os.system_time(:millisecond)}
value = measure_temperatue()
:ok = CubDB.put(key, value)

# Get measurements for a time range:
start_time = ...
end_time = ...
{:ok, measurements} = CubDB.select(db,
  min_key: {:temperature, start_time},
  max_key: {:temperature, end_time}
)

Of course, there are several other possible solutions out there, like DETS (but it does not support sorted collections, so it makes the time series case tricky), Mnesia, or embedded databases like LevelDB, LMDB, SQLite, etc. Here you can find a quick comparison with CubDB. I think CubDB would be a good choice if you are looking for something lightweight and idiomatic from Elixir.

4 Likes

Thanks for the suggestions. CubDB is pretty close to what I was looking for. I may decide to move to something more centralized like Timescale with Postgres at some point, but this should work for now. Thanks @lucaong!

1 Like