collegeimprovements
Elixir for ETL Tasks!
Hello Guys,
Is Elixir a good choice for doing ETL stuff ?
We need to sync few tables and have to transfer about 200 million rows in first iteration of ETL.
We need to do this between Oracle DB and Postgres.
Can gen_stage be used for this kind of task ?
Appreciate any guidance and pointers here ![]()
Most Liked
aseigo
GenStage is perfect for this kind of workload, as it allows the program to be throttled using back-pressure based on what your target system can take (as well as how fast your data source(s) can provide). If you can partition the source data and your transform steps are a bottleneck (as opposed to the extract or load being the bottlenecks due to capacity at either source or destination), then you can also build your application to be clustered, allowing for horizontal scaling of the ETL workload.
Moreover, Elixir makes it trivial to run multiple work orders at the same time, spreading the work out across however many cores you have in the machine running your ETL application.
That said: I would expect that writing this application in Elixir would not provide for the absolute fastest execution times for a single-core, single-threaded approach when compared to other languages available out there (C++, Java, ..), but for concurrency (local multi-core and/or multi-system distribution), durability (fault tollerance), and developer productivity it is really hard to beat for these sorts of applications IME.
minhajuddin
The Stream module is your friend for moving huge volumes of data. You will need to use native tools provided by postgresql and oracle. I did a screencast on how to fast inserts into postgresql using postgrex https://www.youtube.com/watch?v=YQyKRXCtq4s
aseigo
Stream is great, but of course does not bring parallelism along with it. So if one is dealing with small amounts of data, or processes that absolutely must be strictly serialized (and even then there are sometimes tricks that can be played), then Stream makes lots of sense. For larger data sets where you have many cores (in one or more machines) to throw at it, then GenStage and/or Flow can often produce better results.








