Concurrent reads w/ postgres

rm-rf-etc · May 27, 2020, 9:21am

I’m storing time-series data and need to repeat an analysis process across thousands of series of records, in 1 day chunks. I’m planning on querying my data using a timestamp range to fetch each day. If I run this process over 4 years of data, it’s potentially 1,200,000 separate reads.

I want to know if all of my postgres queries will run fully in parallel if each is made from it’s own spawned process, and what other things I should consider to get the best possible completion time.

This app is only for personal use and doesn’t need to serve public requests.

hauleth · May 27, 2020, 10:50am

No, there will be at most N queries in parallel where N is size of the pool. However if you want do so much reads then maybe it would be better to process that data within DB instead of doing so many parallel queries.

rm-rf-etc · May 27, 2020, 3:12pm

Like a stored procedure? Am I able to do what you suggest using Elixir code?

hauleth · May 27, 2020, 3:23pm

I have no idea what you are trying to achieve so I cannot tell what would be the best solution.

bglusman · May 27, 2020, 3:25pm

I don’t know of any way to write stored procedures in Elixir, but there’s also ways to do quite a lot of aggregation and processing in most databases without stored procedures. I think to help guide you more context may be needed though. You could potentially have a very large pool size for your database/have read replicas and have perhaps hundreds of connections per replica, and do all the processing in Elixir, but it may not be the most efficient path. Is what you’re doing a one time process or recurring regularly at some interval? Are you using Postgres or some other database? I have some similar per-day analysis I do in my Elixir app, and make fairly heavy use of Ecto group_by and aggregation to have the database do much of the processing. You said the app is only for personal use, is it an educational project whose primary purpose is to learn things, or you just want to get things done simplest way and don’t care much about how/what you learn along the way?

rm-rf-etc · May 27, 2020, 5:45pm

Yes, I’m using postgres, like in the title

This is just a personal research project that I’m doing with a friend. It’s like sensor data for weather conditions. The concern is being able to change our analysis techniques and be able to see the results quickly.

Some example use cases:
Each record has 6 numeric values. For every morning, get the sum of value a for all records between the 5am and 8am. Also, find the highest value for b, and the lowest value for c. Or, get the delta between d at 5am and the high of the morning for d, as well as get the time of the high for d for that time period.

So we want to do these kinds of calculations across thousands of sensors, going back 4 years, and as we learn from each analysis, we’d like to be able to try new calculations and get results back in a timely manner.

The time interval is in minutes, meaning there’s 1 record for every minute.

hauleth · May 27, 2020, 6:16pm

You can use TimescaleDB (PostgreSQL extension) to make it snappier and all stuff you have described are possible to achieve with SQL, so instead fetching and calculating them in the application, move as much as possible to the DB in form of query.

brightball · May 27, 2020, 7:00pm

This might belong in a separate thread, but I’d be interested to get your impression of Timescale. They made big news recently by making the distributed version free so I’ve been planning to give them a look when I get some time.

rm-rf-etc · May 28, 2020, 8:40pm

I probably won’t be using a cluster.

Overall, getting it setup was pretty darn easy. What was a little more trouble was figuring out how to get a one-to-many relationship to work with a compound index, in ecto and timescale, and it was complicated by me trying to use unix timestamps as part of the index instead of utc_datetime, but I eventually gave up on that and went with utc_datetime, which made everything work.