Elixir for ETL Tasks!

Tags: #<Tag:0x00007f113cbc2538> #<Tag:0x00007f113cbc23f8> #<Tag:0x00007f113cbc22b8>


Hello Guys,
Is Elixir a good choice for doing ETL stuff ?
We need to sync few tables and have to transfer about 200 million rows in first iteration of ETL.
We need to do this between Oracle DB and Postgres.

Can gen_stage be used for this kind of task ?

Appreciate any guidance and pointers here :slight_smile:


GenStage is perfect for this kind of workload, as it allows the program to be throttled using back-pressure based on what your target system can take (as well as how fast your data source(s) can provide). If you can partition the source data and your transform steps are a bottleneck (as opposed to the extract or load being the bottlenecks due to capacity at either source or destination), then you can also build your application to be clustered, allowing for horizontal scaling of the ETL workload.

Moreover, Elixir makes it trivial to run multiple work orders at the same time, spreading the work out across however many cores you have in the machine running your ETL application.

That said: I would expect that writing this application in Elixir would not provide for the absolute fastest execution times for a single-core, single-threaded approach when compared to other languages available out there (C++, Java, …), but for concurrency (local multi-core and/or multi-system distribution), durability (fault tollerance), and developer productivity it is really hard to beat for these sorts of applications IME.


The Stream module is your friend for moving huge volumes of data. You will need to use native tools provided by postgresql and oracle. I did a screencast on how to fast inserts into postgresql using postgrex https://www.youtube.com/watch?v=YQyKRXCtq4s


Stream is great, but of course does not bring parallelism along with it. So if one is dealing with small amounts of data, or processes that absolutely must be strictly serialized (and even then there are sometimes tricks that can be played), then Stream makes lots of sense. For larger data sets where you have many cores (in one or more machines) to throw at it, then GenStage and/or Flow can often produce better results.


Is there something that prevents you from dumping data and than using copy on to load it on PG side?


Presumably they have some transforms to do, perhaps also some targeted extraction. That’s usually what makes most ETL setups more complex than a simple dump/restore …


@andre1sk, @aseigo for initial setup we just take dump of oracle db and then restore it in postgres with copy command.
There are about 2.5 * 10^7 rows. And there are 12 tables more or less like this one.
We do need to do some transformation in the pipeline. And then we need to do some aggregation to make data fast enough for analytics tools(metabase in our case).

We want to use jamdb_oracle for fetching data from oracle but it’s very slow!!
And then we transform it and push that data to postgres with postgrex.

For transformation part we are learning about GenStage and Flow. And we are not sure if we can make select queries faster with help of GenStage or Asnc tasks:icon_exclaim:!!! Currently those select queries are the slowest thing in entire operation.


Well if you are going to postgresql you could have the postgresql server itself access the oracle server via a postgre FDW then do the translations and copying all in-SQL.