Storage on embedded (Nerves) device. Cubdb limits?

tomazbracic · October 9, 2024, 11:18am

Hi,

I am building a Nerves firmware which will be used to collect some PMU (Phaser Measurement Unit) measurements and other necessary things like connectivity over LTE, storage, retrieval of data on request, etc.

Data will be collected at 50Hz rate/interval 24/7/365. This will not be small amount of data.
I would like to have 30 day data on a device. If that would be too problematic I would go for less, but this will be on a remote locations and pretty much decision was made without me

I will be using reComputer as hardware which is based on RPi CM4. Together with LTE module.

For storing data locally there are few options right now on my table and in my head.

CubDB. My first choice for my POC. I developed a logic to benchmark this so I am writing a lot of data in, my keys are timestamps since I will be query on that only for time ranges that I need to retrieve data later if issue detected. I can run my own queries, but currently i have like 2MIO records in it. To my calculations I might have like 110 - 150MIO records in there. Not sure (yet) about how it will behave. I’ve created a logic to delete data based on my retention, etc. For now kind of work just ok. Not sure yet about this compactions (everytime it changes the filename), not sure about behaviour on bigger number of records… don’t know it yet so I am not aware of its limits, gotchas…
Files like Parquet and handling things like “daily index” approach. Easy to drop older than 30 days, easy to query those not being used for writes, BUT hard/can’t access data from current file (open for write)
DuckDB? Really like the idea behind it, but Elixir client is not fully supported (yet). I would like partitioning though, but is not there.
Something else?

Does anyone have any experience with Cubdb and dealing with numbers around those above? Will it perform ok, are there issues to be expected? Would it be better to use SQLite? I am preparing myself a simulation environment, but I am not there yet. Final setup should be on a nerves device. So if anyone can share your own experiences or what to be careful about please I would really appreciate. Or if there is some other intersting approach that you used and it worked well for you I would really like to know about it.

Thanks,
Tomaz

lucaong · October 9, 2024, 3:12pm

Hi @tomazbracic ,
author of CubDB here while myself and others have used CubDB for many data logging tasks, your case may deserve special care. Most cases I am familiar with were either a slower continuous rate, or a higher one but in bursts, or a lower retention. In principle, I don’t see any hard limit, the write speed is not concerning, but due to the long retention and the continuous write, in practice, there are a couple of things to be mindful of.

The main such thing, as you correctly note, is compaction. Compaction progresses concurrently to read and writes, but can take a while for large files, and it will need to catch up with writes that happened while the compaction was ongoing. Since your data write is a continuous 50 writes per second, with no pause, there can be a lot to catch up with. Therefore, while it’s possible that everything would work just fine, compaction is one area where I would recommend doing some preliminary test. Since compaction only reclaims disk space, and does not affect read or write performance noticeably, if you have enough disk space you could disable auto compaction and perform it programmatically at specific chosen moments of downtime (if there is any).

Do you need to query old data live, or are you simply logging it? In the latter case, you could even disable auto compaction completely and start a new database in a new data directory each day, removing old databases and their data directories after the retention period.

In order to help you decide if CubDB is the right choice for you, I have a few other questions:

How important is durability of data or atomicity of transactions in presence of sudden unexpected shutdowns? This is one of the situations that CubDB is designed to handle.
Is it helpful to use native Elixir data structures in your data, or they could just be serialized strings? In the first case CubDB can help a lot, in the second you have more freedom on what to use to store data (SQLite for example).
I assume the write pattern is that of a data logger (mostly only appends). What about your read pattern? Do you need to query your data live?
Do you need the same granularity for all the retention period, or is it possible to summarize or coarse grain data older than, say, a few days?

tomazbracic · October 10, 2024, 7:30am

Hi @lucaong

How important is durability of data or atomicity of transactions in presence of sudden unexpected shutdowns? This is one of the situations that CubDB is designed to handle.

→ Durability and atomicity is very important since measurements are the foundation for further actions for grid operater.

Is it helpful to use native Elixir data structures in your data, or they could just be serialized strings? In the first case CubDB can help a lot, in the second you have more freedom on what to use to store data (SQLite for example).

→ This isn’t really super important. I get data through the proprietary protocol and once deserialized I can do pretty much what ever I want.

I assume the write pattern is that of a data logger (mostly only appends). What about your read pattern? Do you need to query your data live?

→ Yes, writing could be done with only append mode. I won’t have any need for any other CRUD opeartions on this data. So just collecting data in a log fashion. The only needed thing is to be able to “drop”/delete data older than. That’s why I was thinking of using a file approach since that could be easier done. With Cubdb I assume after a daily drop of data older than 30 days I would have to do a manual/automatic compaction, right? And this can be a bit challenging if there would be 120 MIO records in a db. Don’t have a estimation yet what are the sizes here, but I guess in > 10 Gb.

Do you need the same granularity for all the retention period, or is it possible to summarize or coarse grain data older than, say, a few days?

Yes, for sure I could aggregate (if this is what you’re asking) older data. Like InfluxDB or some other DB. I don’t know Cubdb that well yet. I was looking for something like partitioning, I know DuckDB has it, but elixir library doesn’t support it yet. Partitioning would be a great thing, so I could use partitioning by time.

In addition to this, not sure but is there any way to keep the filename the same? Now every time compaction is done (I assume this is the reason) filename changes. So, I was thinking to perhaps accomplish this “file structure” with Cubdb as well. If I would create for each day a new DB file? Then I would just need to take care of this supervision. Perhaps with dynamic supervisor, where worker would be a “definition” for creating new daily DB file/process. I would just need to take care of this writing logic. So there won’t be “non-peak/peak” time. It will always be write at 50Hz rate/interval. So logic around file switch / new DB process creation would have to be done perfectly so no data would be lost. I do have some ideas already, but I will have to test them.

Hope I answered to you and perhaps spark any great suggestion

Thank you for you great first reply.

lucaong · October 10, 2024, 9:53am

If you really only append new logs, and basically never need to query such logs live, I think that CubDB might not be needed. The best features of CubDB are atomic transactions, extreme robustness in face of unexpected shutdowns and power loss, possibility to use native Elixir data structures, and ease of performing both point queries and selection. In your case, as far as I can see, none of that is really needed. You do need durability, but given the append-only write pattern, you can achieve it in other ways.

In light of what you wrote, probably the easiest option is to simply write plain flat log files, one per day, using a naming scheme such as year-month-day.log, and perform log rotation by removing the old ones after the retention period. If your data is such that there is a simple way to separate each entry (say, one per line), and you really never need to query that data, this would work and be simple to maintain and understand. If you need durability, you will have to make sure to fsync each write (slow, but safe), or if it’s acceptable to loose, say, the last few seconds of data, fsync every N seconds or every N writes (more performant). Using a human readable format for your data would have the added benefit of making your logs readable just by opening them in a text editor, or with command line tools like awk (although you would probably need to load the logs into some system for deeper analysis).

If you instead need to query this data live, you could follow a hybrid approach, where you log raw un-aggregated data in plain log files like explained above, but also store aggregated data into a database for live querying. In a few project, for example, my system was allowing live querying such as “what’s the average value of X between last Tuesday 3pm and Saturday 4pm”, which would be served by CubDB but could use data aggregated by minute, and also allowed operators to access the device and download un-aggregated raw log files for offline investigation.

If you do need to query the data live, at arbitrary granularity, but typically only for the current day, you could do more or less the same as the first suggestion, but using CubDB instead of plain log files. You could restart CubDB on a different data directory at the beginning of each day. You can at that point disable compaction completely, which also ensures that the DB file never changes name, and remove old data directories after the retention period, like you would do with plain files. This way, you gain the possibility to query your data live, but on the other hand your logs cannot be read in a plain text editor. Alternatively, you can use SQLite, which is a great option, but would also very likely need some tweaking of the configuration to support the write pattern and performance that you need.

I hope this helps, happy to answer further questions if needed

lucaong · October 10, 2024, 11:06am

If you decide to use plain log files, note that the standard Elixir Logger with its default logger_std_h handler already supports log rotation, as well as delayed_write and filesync_repeat_interval options for tweaking durability and write performance (documented in the file module). Unfortunately, as far as I know, in Elixir and Erlang it is not generally possible to configure a special purpose logger distinct from the global logger.

It is possible to use filters to direct specific logs to a different handler based on level or metadata, but I find it obscures the intent a bit, when what you really want is to treat your data logs differently from all the rest.

One possibility then is to bypass logger and handlers completely and use directly the Erlang disk_log, which offers log rotation and other useful options. One limitation is that it rotates the files by size, not by date, which makes it more difficult to implement precisely the 30 days retention you want.

lawik · October 10, 2024, 11:47am

I think you can use DuckDB transparently under exqlite/EctoSqlite. I haven’t tried it but I have swapped the sqlite library for the sqlcypher variant and that works. DuckDB seems to offer that too:

I don’t think that amount if data should be challenging for DuckDB on a good storage medium.

There are checkpoint compactions you would need to test for reclaiming disk space. Deleting records only marks them for deletion.

Something worth looking at if DuckDB is of interest anyway.

lucaong · October 11, 2024, 4:38pm

Here’s another idea, mixing up suggestions from above, which might be a good solution if you log timestamped values (appending only), but also support live read access to data for a range of timestamps (e.g. “show me all values between timestamp A and B”).

You could use flat append-only log files, rotated each day or so and removed after 30 days, as described above. On top of that, to allow for querying a range of values, you could store in a database like CubDB or SQLite a reference to the file and offset every N log entries.

Your log files would contain all entries (possibly in a human-readable text format, and separated by a suitable separation, e.g. newline). The database would instead contain data indexed by timestamp, pointing at the file and offset of the log entry written at that timestamp. New entries to the database would not be written for every log entry, but rather every N entries (say every 3000 entries, corresponding to 1 minute in your case). It’s also important that the file names are not changed by the log rotation process, and have a definite order (for example, the filename is the year-month-day, or an incrementing integer, or a combination of the two). The mapping in the database would be something like:

# timestamp => { filename, byte_offset }
1728663802 => {"2024-10-11.log", 21345}

If you need to retrieve logs in a range of timestamps, you can look up in the database the closets timestamps to the start and end of your range, and read the corresponding file(s) from/until the byte offset. Every day, a scheduled process could remove log files as well as database entries older than the retention.

Since the database would then receive a much lower write load, you can then safely use any solution that suits your needs, and definitely both CubDB and SQLite would work just fine, even with their default options. You also still get the option of downloading the raw log files for offline inspection.