First of all, I realize that both flow and broadway are built on top of genstage.
I find flow a bit confusing, compared to how broadway is presented. Probably because flow is trying to be applicable to many more uses than just data pipelines.
My Use Case:
I’m fetching from an API and need to run many requests as fast as possible, in parallel. And it’s time-series data with associated sensors. First I fetch a list of sensors (~1,000 of these), then, for each sensor, I will fetch the time series data in 1 day chunks, between now and some selected date in the past. Then I will pipe all chunks to my postgres+timescaledb database (calling
Repo.insert_all/3 per batch).
Based on my what I’ve read and watched, it sounds like broadway would be really ideal, but it seems as if writing my own producer/producer-consumer/consumer is or was not the original intent of the authors. And I’m saying this just based on how the readme comes across to me. Please correct me if I’m wrong.
Flow doesn’t seem as nicely tailored to processing data like I’m doing, and also doesn’t seem as approachable.
Could somebody please advise me as to which of these I should make my focus?
Also, does it make sense to do postgres bulk inserts from parallel consumers? My guess is that it does not. The folks at TimescaleDB recommend a raid 0 array of a small number of drives, and to put the WAL on a separate disk from the data to get better bulk insert performance, but I assume parallel writes are still not supported. Could somebody tell me how this works with elixir? Does elixir run a single process for all DB queries, regardless of how many processes I have sending DB write requests?