Hello Guys,
Is Elixir a good choice for doing ETL stuff ?
We need to sync few tables and have to transfer about 200 million rows in first iteration of ETL.
We need to do this between Oracle DB and Postgres.
GenStage is perfect for this kind of workload, as it allows the program to be throttled using back-pressure based on what your target system can take (as well as how fast your data source(s) can provide). If you can partition the source data and your transform steps are a bottleneck (as opposed to the extract or load being the bottlenecks due to capacity at either source or destination), then you can also build your application to be clustered, allowing for horizontal scaling of the ETL workload.
Moreover, Elixir makes it trivial to run multiple work orders at the same time, spreading the work out across however many cores you have in the machine running your ETL application.
That said: I would expect that writing this application in Elixir would not provide for the absolute fastest execution times for a single-core, single-threaded approach when compared to other languages available out there (C++, Java, …), but for concurrency (local multi-core and/or multi-system distribution), durability (fault tollerance), and developer productivity it is really hard to beat for these sorts of applications IME.
The Stream module is your friend for moving huge volumes of data. You will need to use native tools provided by postgresql and oracle. I did a screencast on how to fast inserts into postgresql using postgrex https://www.youtube.com/watch?v=YQyKRXCtq4s
Stream is great, but of course does not bring parallelism along with it. So if one is dealing with small amounts of data, or processes that absolutely must be strictly serialized (and even then there are sometimes tricks that can be played), then Stream makes lots of sense. For larger data sets where you have many cores (in one or more machines) to throw at it, then GenStage and/or Flow can often produce better results.
Presumably they have some transforms to do, perhaps also some targeted extraction. That’s usually what makes most ETL setups more complex than a simple dump/restore …
@andre1sk, @aseigo for initial setup we just take dump of oracle db and then restore it in postgres with copy command.
There are about 2.5 * 10^7 rows. And there are 12 tables more or less like this one.
We do need to do some transformation in the pipeline. And then we need to do some aggregation to make data fast enough for analytics tools(metabase in our case).
We want to use jamdb_oracle for fetching data from oracle but it’s very slow!!
And then we transform it and push that data to postgres with postgrex.
For transformation part we are learning about GenStage and Flow. And we are not sure if we can make select queries faster with help of GenStage or Asnc tasks!!! Currently those select queries are the slowest thing in entire operation.
Well if you are going to postgresql you could have the postgresql server itself access the oracle server via a postgre FDW then do the translations and copying all in-SQL.
Was also planning on using genstage for etl but in kind of a dilema. My producer is currently pulling down the data by the minute to provide near real time data. But after some time, this minute data is irrelevant and I want to roll it up to an hourly one like a sanity check.
Hmm do I need a separate genstage producer for this? I want to share the same code with the minute producer but just pass in some parameter to tell it to do hourly but not sure if I need 2 Process.send_after one by the minute and one by the hour. hmmm…
@mbuhot Hey! Thanks for the input! I didnt think about putting it on the producer consumer thinking its still a producer in itself because it will roll it up using sql.
I wonder if this is the setup.
P -> PC(daily) schedule every minute -> PC(hourly) schedule every hour -> C
@mbuhot Hi there! I was trying to figure this out. I haven’t seen any examples on multiple producer consumers though. I think there are multiple consumers or multiple producers but I’m not sure how the multiple producer/consumer works. Will the hourly subscribe to the minute producer consumer?
I need the per minute data to still persist to the db. so maybe i need a separate p-pc-c stage for hourly?
p(per minute) -> pc (transform) -> c (db)
p(per hour) -> pc( transform) -> c (db)
hmm I thought I can just reuse the existing p and just have it have 2 scheduled functions one for minute and one for hourly.